활용3. 웹스크래핑(0825-0829)/urllib

스크래핑: urllib, BeautifulSoup

나도초딩 2022. 9. 8.

목차

뷰티풀 수프는 크롤링 프로젝트에서 scrapy와 같이 가장 많이 사용되는 도구중 하나다.
HTML 문서에서 필요한 부분만 출력해서, 크롤링의 속도를 올려주는 모듈이다.
prettyfy() 사용
선택한 요소 위로 올라가면서 탐색하는 도구다.
동일 레벨에 있는 태그들을 가져온다.
find_all() 메소드를 사용할 때 필요한 갯수만 필터링 하는 인자다.
전체 검색결과에서 직계자손 태그만 출력하는 인자다.
find_all()은 검색된 태그내 모든 태그를 검색한다.
find()는 검색된 태그내 한개의 태그를 검색한다.
검색 태그의 상위 태그들을 검색한다.
동일 레벨의 다음 태그들을 검색한다.
동일 레벨의 이전 태그들을 검색한다.
find_all_next() : 동일 레벨의 모든 다음 태그들을 검색한다.
find_next() : 동일 레벨의 한개의 다음 태그를 검색한다.
find_all_next() : 동일 레벨의 모든 이전 태그들을 검색한다.
find_next() : 동일 레벨의 한개의 이전 태그를 검색한다.
# 태그 검색
# 태그 아래의 태그를 검색한다.
# 태그 바로 아래의 태그를 검색한다.
# CSS 클래스로 검색한다.
# ID 클래스로 검색한다.
# 속성이 존재하는지 테스트 한다
# 속성으로 태그를 찾는다.
# 태그안의 속성을 찾는다.
tag.attrs['속성값']

뷰티풀 수프는 크롤링 프로젝트에서 scrapy와 같이 가장 많이 사용되는 도구중 하나다.

HTML 문서에서 필요한 부분만 출력해서, 크롤링의 속도를 올려주는 모듈이다.

# BeautifulSoup 추가하기

from bs4 import BeautifulSoup as bs

from urllib import request

url = 'https://www.example.com'

html = request.urlopen(url)

soup = bs(html, 'html.parser')

Colored by Color Scripter

# 정갈하게 출력하기

prettyfy() 사용

from bs4 import BeautifulSoup as bs

from urllib import request

# prettify() 활용하기

markup = 'http://example.com/">I linked to example.com'</a href="

soup = BeautifulSoup(markup)

soup.prettify()

# '\n \n \n \n http://example.com/">\n...'</a href="

print(soup.prettify())

# 출력결과

# <html>

# <head>

# </head>

# <body>

# http://example.com/"></a href="

# I linked to

#

# example.com

#

# </a>

# </body>

# </html>

Colored by Color Scripter

# 태그 이름 사용하기

soup.head

# <head><title>The Dormouse's story</title></head>

soup.title

# <title>The Dormouse's story</title>

soup.body.b

# The Dormouse's story

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Colored by Color Scripter

# contents / children 사용하기

# .contents 사용하기

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

soup.contents[0].name

# u'html'

# 문자열은 contenst사용 불가함 / for 문 사용

text = title_tag.contents[0]

text.contents

# AttributeError: 'NavigableString' object has no attribute 'contents'

for child in title_tag.children:

print(child)

# The Dormouse's story

Colored by Color Scripter

# descendants 사용하기

for child in head_tag.descendants:

print(child)

Colored by Color Scripter

# string 사용하기

# 태그에 자손이 한개만 존재해야 string 사용가능함

title_tag.string

# u'The Dormouse's story'

head_tag.string

# u'The Dormouse's story'

# 하나 이상의 태그가 존재하면 string은 None을 반환한다.

print(soup.html.string)

# None

Colored by Color Scripter

# stripped_string 사용하기

# string에서 불필요한 공백을 제거할 때 사용한다.

for string in soup.stripped_strings:

print(repr(string))

# u"The Dormouse's story"

# u'Once upon a time there were three little sisters; and their names were'

# u'Elsie'

# u','

# u'Lacie'

# u'and'

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'...'

Colored by Color Scripter

# parents 사용하기

선택한 요소 위로 올라가면서 탐색하는 도구다.

# parents는 선택한 요소 위에 있는 모든 상위 태그들을 검색한다.

link = soup.a

link

# http://example.com/elsie" id="link1">Elsie</a class="sister" href="

for parent in link.parents:

if parent is None:

print(parent)

else:

print(parent.name)

# p

# body

# html

# [document]

# None

Colored by Color Scripter

# next_sibling(s) / previous_sibling(s) 사용하기

동일 레벨에 있는 태그들을 가져온다.

sibling_soup.b.next_sibling

# <c>text2</c>

sibling_soup.c.previous_sibling

# text1

# next_siblings / previous_siblings는 동일 레벨 모든 태그 검색

for sibling in soup.a.next_siblings:

print(repr(sibling))

# u',\n'

# http://example.com/lacie" id="link2">Lacie</a class="sister" href="

# u' and\n'

# http://example.com/tillie" id="link3">Tillie</a class="sister" href="

# u'; and they lived at the bottom of a well.'

# None

for sibling in soup.find(id="link3").previous_siblings:

print(repr(sibling))

# ' and\n'

# http://example.com/lacie" id="link2">Lacie</a class="sister" href="

# u',\n'

# http://example.com/elsie" id="link1">Elsie</a class="sister" href="

# u'Once upon a time there were three little sisters; and their names were\n'

# None

Colored by Color Scripter

# list 활용하기

soup.find_all(["a", "b"])

# [The Dormouse's story,

# http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

Colored by Color Scripter

# limit 인자 활용하기

find_all() 메소드를 사용할 때 필요한 갯수만 필터링 하는 인자다.

soup.find_all("a", limit=2)

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie]</a class="sister" href="

Colored by Color Scripter

# recursive 인자 활용하기

전체 검색결과에서 직계자손 태그만 출력하는 인자다.

soup.html.find_all("title")

# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)

# []

Colored by Color Scripter

# find_all() 활용하기

find_all()은 검색된 태그내 모든 태그를 검색한다.

# name 인자 활용

soup.find_all("title")

# [<title>The Dormouse's story</title>]

# css 활용

soup.find_all("p", "title")

# [The Dormouse's story]

# 태그 활용

soup.find_all("a")

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

# 키워드 인자

soup.find_all(id="link2")

# [http://example.com/lacie" id="link2">Lacie]</a class="sister" href="

# 정규표현식

import re

soup.find(text=re.compile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

css_soup.find_all("p", class_="body strikeout")

# []

soup.find_all("a", "sister")

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

# 텍스트 활용

soup.find_all(text="Elsie")

# [u'Elsie']

Colored by Color Scripter

# find() 활용하기

find()는 검색된 태그내 한개의 태그를 검색한다.

soup.find_all('title', limit=1)

# [<title>The Dormouse's story</title>]

soup.find('title')

# <title>The Dormouse's story</title>

# 태그 이름을 사용해서 검색하는 것은

# find()함수를 반복 실행하는 것이다.

soup.head.title

# <title>The Dormouse's story</title>

soup.find("head").find("title")

# <title>The Dormouse's story</title>

# find_parent(s)() 활용하기

검색 태그의 상위 태그들을 검색한다.

a_string.find_parents("a")

# [http://example.com/lacie" id="link2">Lacie]</a class="sister" href="

a_string.find_parent("p")

# Once upon a time there were three little sisters; and their names were

# http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie and</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie;</a class="sister" href="

# and they lived at the bottom of a well.

Colored by Color Scripter

# find_next_sibling(s)()

동일 레벨의 다음 태그들을 검색한다.

first_link.find_next_siblings("a")

# [http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

first_story_paragraph = soup.find("p", "story")

first_story_paragraph.find_next_sibling("p")

# ...

Colored by Color Scripter

# find_previous_sibling(s)()

동일 레벨의 이전 태그들을 검색한다.

last_link.find_previous_siblings("a")

# [http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/elsie" id="link1">Elsie]</a class="sister" href="

first_story_paragraph = soup.find("p", "story")

first_story_paragraph.find_previous_sibling("p")

# The Dormouse's story

# find_all_next() / find_next() 활용하기

find_all_next() : 동일 레벨의 모든 다음 태그들을 검색한다.

find_next() : 동일 레벨의 한개의 다음 태그를 검색한다.

first_link.find_all_next(text=True)

# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',

# u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']

first_link.find_next("p")

# ...

Colored by Color Scripter

# find_all_previous() / find_previous() 활용하기

find_all_next() : 동일 레벨의 모든 이전 태그들을 검색한다.

find_next() : 동일 레벨의 한개의 이전 태그를 검색한다.

first_link.find_all_previous("p")

# [Once upon a time there were three little sisters; ...,

# The Dormouse's story]

first_link.find_previous("title")

# <title>The Dormouse's story</title>

Colored by Color Scripter

# CSS 선택자 활용하기

# 태그 검색

soup.select("title")

# [<title>The Dormouse's story</title>]

# 태그 아래의 태그를 검색한다.

soup.select("body a")

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

# 태그 바로 아래의 태그를 검색한다.

soup.select("head > title")

# [<title>The Dormouse's story</title>]

soup.select("p > a")

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

soup.select("body > a")

# []

Colored by Color Scripter

# CSS 클래스로 검색한다.

soup.select(".sister")

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

soup.select("[class~=sister]")

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

Colored by Color Scripter

# ID 클래스로 검색한다.

soup.select("#link1")

# [http://example.com/elsie" id="link1">Elsie]</a class="sister" href="

soup.select("a#link2")

# [http://example.com/lacie" id="link2">Lacie]</a class="sister" href="

Colored by Color Scripter

# 속성이 존재하는지 테스트 한다

soup.select('a[href]')

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

Colored by Color Scripter

# 속성으로 태그를 찾는다.

soup.select('a[href="http://example.com/elsie"]')

# [http://example.com/elsie" id="link1">Elsie]</a class="sister" href="

soup.select('a[href^="http://example.com/"]')

# [http://example.com/elsie" id="link1">Elsie,</a class="sister" href="

# http://example.com/lacie" id="link2">Lacie,</a class="sister" href="

# http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

soup.select('a[href$="tillie"]')

# [http://example.com/tillie" id="link3">Tillie]</a class="sister" href="

soup.select('a[href*=".com/el"]')

# [http://example.com/elsie" id="link1">Elsie]</a class="sister" href="

Colored by Color Scripter

# href 속성 찾기

# 태그안의 속성을 찾는다.

tag.attrs['속성값']

html = """

<html>

<body>

This is Test

</div>

</body>

</html>

"""

soup = bs(html, 'html.parser')

print(soup.a.string)

print(soup.a.attrs['href'])

저작자표시 비영리 변경금지 (새창열림)

'활용3. 웹스크래핑(0825-0829) > urllib' 카테고리의 다른 글

urllib.request 와 requests (0)	2022.09.10
스크래핑: urllib 필수 기초 (0)	2022.09.10

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

스크래핑: urllib, BeautifulSoup

뷰티풀 수프는 크롤링 프로젝트에서 scrapy와 같이 가장 많이 사용되는 도구중 하나다.

HTML 문서에서 필요한 부분만 출력해서, 크롤링의 속도를 올려주는 모듈이다.

# BeautifulSoup 추가하기

# 정갈하게 출력하기

prettyfy() 사용

# 태그 이름 사용하기

# contents / children 사용하기

# descendants 사용하기

# string 사용하기

# stripped_string 사용하기

# parents 사용하기

선택한 요소 위로 올라가면서 탐색하는 도구다.

# next_sibling(s) / previous_sibling(s) 사용하기

동일 레벨에 있는 태그들을 가져온다.

# list 활용하기

# limit 인자 활용하기

find_all() 메소드를 사용할 때 필요한 갯수만 필터링 하는 인자다.

# recursive 인자 활용하기

전체 검색결과에서 직계자손 태그만 출력하는 인자다.

# find_all() 활용하기

find_all()은 검색된 태그내 모든 태그를 검색한다.

# find() 활용하기

find()는 검색된 태그내 한개의 태그를 검색한다.

# find_parent(s)() 활용하기

검색 태그의 상위 태그들을 검색한다.

# find_next_sibling(s)()

동일 레벨의 다음 태그들을 검색한다.

# find_previous_sibling(s)()

동일 레벨의 이전 태그들을 검색한다.

# find_all_next() / find_next() 활용하기

find_all_next() : 동일 레벨의 모든 다음 태그들을 검색한다.

find_next() : 동일 레벨의 한개의 다음 태그를 검색한다.

# find_all_previous() / find_previous() 활용하기

find_all_next() : 동일 레벨의 모든 이전 태그들을 검색한다.

find_next() : 동일 레벨의 한개의 이전 태그를 검색한다.

# CSS 선택자 활용하기

# 태그 검색

# 태그 아래의 태그를 검색한다.

# 태그 바로 아래의 태그를 검색한다.

# CSS 클래스로 검색한다.

# ID 클래스로 검색한다.

# 속성이 존재하는지 테스트 한다

# 속성으로 태그를 찾는다.

# href 속성 찾기

# 태그안의 속성을 찾는다.

tag.attrs['속성값']

'활용3. 웹스크래핑(0825-0829) > urllib' 카테고리의 다른 글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역