활용3. 웹스크래핑(0825-0829)/기타

네이버 웹툰 (urllib: request, request.urlretrieve)

나도초딩 2022. 8. 29.

목차

파이썬 beautifulsoup로 웹툰 크롤링, 다운로드 하기

*본 글은 공부목적으로만 참고하세요 파이썬 크롤링 시리즈 네이버 웹툰 이미지 크롤링, 저장하기: 현재글 셀레니움으로 웹 게임 자동 매크로 만들기: foxtrotin.tistory.com/179 네이버 실시간 검색어...

foxtrotin.tistory.com

request.urlopen()

request.urlretrieve(url, 저장명)

os.chdir("..") # 상위폴더

re.sub() : 태그제거 https://wikidocs.net/4308

from bs4 import BeautifulSoup 
import urllib.request
import os, re #태그 제거


#Access Denied 에러 우회
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)

#크롤링 할 웹툰 주소로 웹 페이지 요청
# html = urllib.request.urlopen("https://comic.naver.com/webtoon/list.nhn?titleId=725586&weekday=fri")
html = urllib.request.urlopen("https://comic.naver.com/webtoon/list?titleId=711422&page=1")
soup = BeautifulSoup(html.read(),"html.parser") #웹 페이지 파싱
html.close() #닫기

comic_title = soup.find("div", {"class", "detail"}).find("h2").text.split()[0] #만화 이름

os.chdir("/Users/kang/Downloads/") #다운로드 폴더
dir = comic_title
if not os.path.isdir(dir):
    os.mkdir(dir)
    print(comic_title+" 디렉토리 생성")
else:
    print("같은 이름의 디렉토리가 이미 있음")
os.chdir("/Users/kang/Downloads/"+dir) #다운로드 받을 폴더로 이동


comic_list=[]
tmp_list=soup.select('td>a') #<td>안에 <a>태그에
for i in tmp_list:
    if('https' in i['href']): #다음 화를 미리 만나보세요 링크 패스
        continue
    comic_list.append(i['href'])
comic_list = sorted(set(comic_list))

for i in range(len(comic_list)):
    ep_url = url="https://comic.naver.com"+comic_list[i]
    html = urllib.request.urlopen(ep_url)
    soup2 = BeautifulSoup(html.read(),"html.parser")

    ep = soup2.find('h3') #<h3>이름</h3>
    ep_title = re.sub('<.*?>', '', str(ep)) #이름만 남게
    
    if not os.path.isdir(ep_title):
        os.mkdir(ep_title)
        print(ep_title+" 디렉토리 생성")
    else:
        print("같은 이름의 디렉토리가 이미 있음")
    os.chdir(ep_title) #이동

    img_div = soup2.find("div", {"class", "wt_viewer"})
    img_all = img_div.findAll("img")

    num = 1
    for j in img_all:
        img_path = j.get("src")
        img_num = str(num)+".png"
        urllib.request.urlretrieve(img_path, img_num)
        num = num + 1
    print(ep_title+" 다운로드 완료")
    os.chdir("..") #상위 폴더로

print("다운로드 끝")

저작자표시 비영리 변경금지

'활용3. 웹스크래핑(0825-0829) > 기타' 카테고리의 다른 글

티스토리 test 중 (0)	2022.10.22

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

네이버 웹툰 (urllib: request, request.urlretrieve)

'활용3. 웹스크래핑(0825-0829) > 기타' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역