[Spring] - Jsoup์„ ์ด์šฉํ•œ ํฌ๋กค๋ง

๐Ÿ› ๏ธ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ

๐Ÿƒ Spring : Spring Boot 3.1.3

๐Ÿ› ๏ธ Java : Amazon corretto 17

๐Ÿ› ๏ธ ๊ตฌํ˜„

Jsoup ์ ์šฉ

์šฐ์„  Jsoup ๊ณต์‹ ๋ฌธ์„œ์— ์ ํžŒ ๊ธ€์„ ํ™•์ธํ•ด๋ณด์ž!

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Jsoup์ด๋ž€, ์‹ค์„ธ๊ณ„ HTML๊ณผ ์—ฐ๋™ํ•˜๊ธฐ ์œ„ํ•œ ์ž๋ฐ” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, HTML5 DOM method์™€ CSS selector ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ URL ๊ฐ€์ ธ์˜ค๊ธฐ์™€ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐ ์กฐ์ž‘์— ๋งค์šฐ ํŽธ๋ฆฌํ•œ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์†Œ๊ฐœ์™€ ๊ฐ™์ด ํŠน์ • URL์— ์žˆ๋Š” HTML ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์ด๋‹ค.

Spring์—์„œ ์ด๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด build.gradle์— ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

# build.gradle
implementation 'org.jsoup:jsoup:1.15.3'

maven ์‚ฌ์šฉ์ž๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

# pom.xml
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.3</version>
</dependency>

ํฌ๋กค๋ง ํ…Œ์ŠคํŠธ

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTest {
    // ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•  URL
    private final String URL = "https://jwhy-study.tistory.com/38";

    @Test
    void getHtmlTest() {
        try {
            Document document = Jsoup.connect(URL).get();
            System.out.println(document);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

}

์œ„ ์ฝ”๋“œ์™€ ๊ฐ™์ด Jsoup์—์„œ ์ œ๊ณตํ•˜๋Š” Document์— URL์„ ์—ฐ๊ฒฐํ•œ ๋’ค, ์ถœ๋ ฅํ•˜๋ฉด ํ•ด๋‹น ํŽ˜์ด์ง€์˜ ๋ Œ๋”๋ง๋œ HTML ์ฝ”๋“œ๊ฐ€ ๋œฌ๋‹ค.

์ด์ œ ๋‚ด๊ฐ€ ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์€ ๋ถ€๋ถ„์„ ๊ฐ€์ ธ์™€๋ณด์ž!

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTest {

    private final String URL = "https://jwhy-study.tistory.com/38";

    @Test
    void getHtmlTest() {
        try {
            int cnt = 0;
            Document document = Jsoup.connect(URL).get();
            // blockquote ํƒœ๊ทธ ๋‚ด๋ถ€์˜ p ํƒœ๊ทธ์— ์žˆ๋Š” ๋ชจ๋“  ์ •๋ณด๋ฅผ ์„ ํƒํ•ด content์— ์ €์žฅํ•œ๋‹ค.
            // Elements content = document.select("blockquote p");

            // blockquote ํƒœ๊ทธ์— ์žˆ๋Š” ๋ชจ๋“  ๋ฌธ์ž์—ด์„ ๊ฐ€์ ธ์˜จ๋‹ค.
            Elements content = document.select("blockquote");

            for (Element e  : content) {
                // ํ•ด๋‹น ํƒœ๊ทธ์— ์žˆ๋Š” ๋ฌธ์žฅ ์ถœ๋ ฅ
                String text = e.text();

                cnt++;
            }

            assertTrue(cnt == 10);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

}

ํ™•์ธํ•ด๋ณด๋‹ˆ ์ด 10๊ฐœ์˜ ์ธ์šฉ๋ฌธ์ด ์กด์žฌํ•˜๋Š”๋ฐ, ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ์ •์ƒ์ ์œผ๋กœ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค!

Element e์—์„œ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

// ์ฒซ td ํƒœ๊ทธ ์•ˆ์— ์žˆ๋Š” ์ฒซ p ํƒœ๊ทธ์˜ ๋‚ด์šฉ
element.select("td:eq(0) p:eq(0)").text();

// ์ฒซ td ํƒœ๊ทธ ์•ˆ์— ์žˆ๋Š” img ํƒœ๊ทธ์˜ src ์†์„ฑ ๋‚ด์šฉ
element.select("td:eq(0) img").attr("src")

์œ„ ์ฝ”๋“œ์™€ ๊ฐ™์ด ๋‹ค์–‘ํ•œ CSS ์„ ํƒ์ž ์ฟผ๋ฆฌ๋ฅผ ์ง€์›ํ•˜๋‹ˆ ์ž˜ ํ™•์ธํ•ด๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค!

๐Ÿค” ํšŒ๊ณ 

ํฌ๋กค๋ง์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ LISTLY์— ์ž˜ ์ •๋ฆฌ๋˜์–ด์žˆ๋‹ค.

 

์š”์•ฝํ•ด๋ณด์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ์ž๋ฃŒ์˜ ์ถœ์ฒ˜, DB, ์ €์ž‘๊ถŒ, ๊ฐœ์ธ ์‹ ์ƒ์— ๊ด€ํ•œ ์ž๋ฃŒ ์ ‘๊ทผ ๋“ฑ์— ๋Œ€ํ•ด์„œ ์‹ ์ค‘ํ•˜๊ฒŒ ์ž˜ ์‚ดํŽด๋ณธ ๋’ค ์‚ฌ์šฉํ•˜์ž.
  • Robots.txs ํŒŒ์ผ์„ ํ†ตํ•ด ํฌ๋กค๋งํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์œ„๋ฅผ ํ™•์ธํ•˜๊ณ , ๊ทธ ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ์ง„ํ–‰ํ•˜์ž.
  • ๋„ˆ๋ฌด ๋งŽ์€ ์š”์ฒญ์„ ๋ณด๋‚ด ํ•ด๋‹น ์„œ๋ฒ„์— ๋ถ€ํ•˜๋ฅผ ์ฃผ์ง€ ๋ง์ž.

ํฌ๋กค๋ง ์ž์ฒด๊ฐ€ ๋ถˆ๋ฒ•์€ ์•„๋‹ˆ์ง€๋งŒ, ์ฃผ์˜ํ•ด์„œ ์‚ฌ์šฉํ•ด์•ผํ•  ๊ฒƒ ๊ฐ™๋‹ค!