유튜브 크롤링

당신은 주제를 찾고 있습니까 “유튜브 크롤링 – 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ]“? 다음 카테고리의 웹사이트 https://you.maxfit.vn 에서 귀하의 모든 질문에 답변해 드립니다: https://you.maxfit.vn/blog/. 바로 아래에서 답을 찾을 수 있습니다. 작성자 에이림 이(가) 작성한 기사에는 조회수 9,724회 및 좋아요 130개 개의 좋아요가 있습니다.

유튜브 크롤링 주제에 대한 동영상 보기

여기에서 이 주제에 대한 비디오를 시청하십시오. 주의 깊게 살펴보고 읽고 있는 내용에 대한 피드백을 제공하세요!

여기에서 동영상 보기

d여기에서 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ] – 유튜브 크롤링 주제에 대한 세부정보를 참조하세요

안녕하세요! ※코드는 맨 아래 깃허브로 들어와 주세요 🙂
R아두면 쓸모있는 신비한 잡학코드 [R쓸신잡]으로 돌아왔습니다.
#크롤링이라는 기법 많이들 활용하시고 계시죠?
오늘은 R패키지 중 RSelenium 을 활용하여 유튜브 제목과 댓글을 수집해보는 영상을 가져왔습니다 🙂
셀레니움은 사실 크롤링을 하기위한 용도보다는 웹페이지를 제어할 수 있는 기능을 가지고 있습니다.
그래서 우린 더 똑똑한 크롤링을 할 수 있으며
이러한 패키지를 통해 유튜브영상 제목 및 댓글들을 수집해 보고자 합니다!!
곧 ! 댓글 수집 3편이 업로드 되니 조금만 기다려 주세요!
항상 지켜봐 주셔서 감사합니다!
★ 에이림 깃허브 찾아가기 ★
https://github.com/Leeyua-airim/shiny_repo/tree/master/RSelenium_Youtube
#유튜브크롤링 #댓글크롤링 #R쓸신잡

유튜브 크롤링 주제에 대한 자세한 내용은 여기를 참조하세요.

유튜브 크롤링(3) 올인원 – 채널 제목, 댓글, 조회수, 자막까지

특정 유튜브 채널에서 동영상 목록의 링크를 가져오기 (채널명, 구독자수). – 제목, 조회수, 날짜, 좋아요 수, 싫어요 수, 댓글 개수. – 댓글 크롤링 …

+ 자세한 내용은 여기를 클릭하십시오

Source: 0goodmorning.tistory.com

Date Published: 4/14/2021

유튜브 크롤링 – velog

유튜브 크롤링 코드. … 검색어를 입력하면 해당 유튜브 영상의 정보를 csv파일로 저장하고 원하는 값을 입력하면 해당 url로 연결하여 영상을 재생 …

+ 더 읽기

Source: velog.io

Date Published: 2/9/2021

파이썬 유튜브 크롤링 셀레니움 1편 – 코딩하는 금융인

파이썬 Selenium 유튜브(Youtube) 크롤링 목표 : 파이썬 자동화 모듈 selenium의 webdriver를 사용하여 유튜브에 원하는 검색어를 던져 나오는 영상 …

+ 여기에 보기

Source: codingspooning.tistory.com

Date Published: 4/15/2022

유튜브크롤링

You’re offline. Check your connection. Retry. Info. Shopping. Tap to unmute. If playback doesn’t begin shortly, try restarting your device.

+ 여기에 표시

Source: www.youtube.com

Date Published: 10/8/2021

[PYTHON] 파이썬 유튜브_크롤링 (COLDPLAY X BTS)

이번에는 유튜브 크롤링을 진행해보려고 합니다. . 신사업 구축, 경쟁사 분석, 시장 동향 등 다양한 목적으로 유튜브 데이터를 수집하여, …

+ 여기를 클릭

Source: hyunhp.tistory.com

Date Published: 5/29/2022

파이썬 유튜브 제목, 조회수 크롤링하기 – Dorulog

안녕하세요. 오랫만에 파이썬 포스팅을 하게 되었네요. 오늘은 유튜브 페이지의 제목과 조회수를 크롤링 해보겠습니다. 유튜브 페이지 크롤링 저번에 …

+ 여기에 보기

Source: dorudoru.tistory.com

Date Published: 4/26/2022

[Python] 유튜브 콘텐츠 크롤러 코드 Version 1.0 – Hey Tech

이처럼 Data Scraping을 하는 프로그램을 Data Sraper 또는 Web Scraper라고 부릅니다. 유튜브 내 특정 검색 결과의 콘텐츠를 자동 탐색하는 모습. (1) …

+ 더 읽기

Source: heytech.tistory.com

Date Published: 5/11/2022

[Python] 파이썬으로 유튜브 크롤링 – 1 – 개인용 복습공간

[Python] 파이썬으로 유튜브 크롤링 – 1 · 유튜브 채널에 들어가 우클릭 – 검사를 눌러서 가져올 텍스트를 확인한다. · 동영상의 제목, 유튜버, 동영상 …

+ 자세한 내용은 여기를 클릭하십시오

Source: taehwanis.tistory.com

Date Published: 11/23/2021

따라서 주제는 유튜버들중 한분을 정해서 데이터 수집부터 분석까지 진행해보고자 합니다. 또한, 본 포스팅에서는 유튜버에 대한 크롤링을 해보고자 …

+ 여기에 더 보기

Source: shinminyong.tistory.com

Date Published: 1/24/2022

유튜브 크롤링하기(제목, 주소, 조회수) – 잘 먹고 잘사는 법 159

유튜브 크롤링하기(제목, 주소, 조회수). by 박완밥 2020. 6. 1. 320×100. from selenium import webdriver from bs4 import BeautifulSoup as bs import pandas as pd …

+ 더 읽기

Source: winter-time.tistory.com

Date Published: 7/6/2022

주제와 관련된 이미지 유튜브 크롤링

주제와 관련된 더 많은 사진을 참조하십시오 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ]. 댓글에서 더 많은 관련 이미지를 보거나 필요한 경우 더 많은 관련 기사를 볼 수 있습니다.

주제에 대한 기사 평가 유튜브 크롤링

Author: 에이림
Views: 조회수 9,724회
Likes: 좋아요 130개
Date Published: 2019. 8. 16.
Video Url link: https://www.youtube.com/watch?v=P9uNay2atuQ

유튜브 크롤링(3) 올인원 – 채널 제목, 댓글, 조회수, 자막까지

지금 크롤링을 하고 있어서 시간이 나는 김에 글을 작성합니다. 크롤링도 크롤링이지만 이 데이터를 어떻게 정제할지가 더 고민이네요. 지난 번 글들을 활용해서 작성하오니 본인의 목적에 맞게끔 수정해서 사용하면 됩니다!

기능

– 특정 유튜브 채널에서 동영상 목록의 링크를 가져오기 (채널명, 구독자수)

– 제목, 조회수, 날짜, 좋아요 수, 싫어요 수, 댓글 개수

– 댓글 크롤링 (번역 기능 추가)

– 자동번역 자막 추출

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 from selenium import webdriver from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys import time options = webdriver.ChromeOptions() # 크롬 옵션 객체 생성 user_agent = “Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36 ” options.add_argument( ‘user-agent=’ + user_agent) options.add_argument( ‘headless’ ) # headless 모드 설정 options.add_argument( “window-size=1920×1080” ) # 화면크기(전체화면) options.add_argument( “disable-gpu” ) options.add_argument( “disable-infobars” ) options.add_argument( “–disable-extensions” ) options.add_argument( “–mute-audio” ) #mute options.add_argument( ‘–blink-settings=imagesEnabled=false’ ) #브라우저에서 이미지 로딩을 하지 않습니다. options.add_argument( ‘incognito’ ) #시크릿 모드의 브라우저가 실행됩니다. options.add_argument( “–start-maximized” ) #1 prefs = { “translate_whitelists” : { “en” : “ko” }, “translate” :{ “enabled” : “true” } } options.add_experimental_option( “prefs” , prefs) #2 prefs = { “translate_whitelists” : { “your native language” : “ko” }, “translate” :{ “enabled” : “True” } } options.add_experimental_option( “prefs” , prefs) #3 options.add_experimental_option( ‘prefs’ , { ‘intl.accept_languages’ : ‘ko,ko_kr’ }) Colored by Color Scripter cs

기본 셀레니움 webdriver 세팅입니다. prefs 기능은 영어를 번역할 때 필요한 기능이라서 끄셔도 상관 없습니다. 그리고 처음에 어떻게 돌아가는지 궁금하시면 # options.add_argument(‘headless’) headless 기능을 꺼주세요.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 import os import pandas as pd import winsound ytb = pd.read_csv( ‘youtube_link.csv’ ) ytb_link = ytb.link.to_list() for i in ytb_link : driver = webdriver.Chrome( ‘chromedriver.exe’ , options = options) driver.get(i) # 스크롤 다운 time.sleep( 1. 5 ) endkey = 4 # 90~120개 / 늘릴때 마다 30개 while endkey: driver.find_element_by_tag_name( ‘body’ ).send_keys(Keys.END) time.sleep( 0. 3 ) endk – = 1 channel_name = driver.find_element_by_xpath( ‘//*[@id=”text-container”]’ ).text subscribe = driver.find_element_by_css_selector( ‘#subscriber-count’ ).text channel_name = re.sub( ‘[=+,#/\?:^$.@*\”※~&%ㆍ!』\\‘|\[\]\<\>`\’…《\》]’ , ” , channel_name) # print(channel_name,subscribe) # bs4 실행 html = driver.page_source soup = BeautifulSoup(html, ‘lxml’ ) video_list0 = soup.find( ‘div’ , { ‘id’ : ‘contents’ }) video_list2 = video_list0.find_all( ‘ytd-grid-video-renderer’ ,{ ‘class’ : ‘style-scope ytd-grid-renderer’ }) base_url = ‘https://www.youtube.com’ video_url = [] # 반복문을 실행시켜 비디오의 주소를 video_url에 넣는다. for i in range ( len (video_list2)): url = base_url + video_list2[i].find( ‘a’ ,{ ‘id’ : ‘thumbnail’ })[ ‘href’ ] video_url.append(url) driver.quit() if subscribe : channel = channel_name + ‘ – ‘ + subscribe else : channel = channel_name directory = f ‘data/{channel}/subtitle’ if not os.path.exists(directory): os.makedirs(directory) print (channel, len (video_url)) ytb_info(video_url, channel) print () winsound.PlaySound( ‘sound.wav’ , winsound.SND_FILENAME) Colored by Color Scripter cs

ytb_link : 본인이 수집하고자하는 채널을 리스트 형식으로 만들어주세요 . 저는 csv 파일로 만들어서 컬럼 이름을 ‘link’로 하여 생성을 했습니다.

channel : 채널 이름으로 폴더를 만들기 때문에 , 폴더 이름에 들어가면 오류가 생기는 부호들을 미리 전처리 합니다. subtitle까지 만든 건 미리 자막 파일을 저장할 수 있는 폴더도 같이 만들어놨습니다.

# 한 채널이 끝날 때마다 윈도우 플레이사운드로 알려줍니다. 시끄럽다고 생각하시면 끄면 됩니다.

1 2 3 4 5 6 7 8 9 10 11 12 import time last_page_height = driver.execute_script( “return document.documentElement.scrollHeight” ) while True : driver.execute_script( “window.scrollTo(0, document.documentElement.scrollHeight);” ) time.sleep( 0. 5 ) if new_page_height = = last_page_height: break last_page_height = new_page_height time.sleep( 0. 75 ) Colored by Color Scripter cs

endkey : 본인이 수집하고자 하는 채널의 링크 개수를 결정합니다 . 현재 설정으로는 90~120개를 수집합니다. time.sleep(2)으로 설정하시면 180개까지 크롤링을 합니다. endkey 개수를 늘리면 30개씩 추가가 됩니다. 에라 모르겠다하고 모든 링크를 크롤링하시려면 위에 코드를 입력해주세요.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 # 정보만 크롤링하고 싶을 때 from bs4 import BeautifulSoup import pyautogui import pandas as pd import re def ytb_info2(video_url,channel) : print (f ‘{channel}’ , ‘ 크롤링 시작’ ) driver = webdriver.Chrome( ‘C:/work/python/Asia_GAN/myproject/youtube/chromedriver.exe’ , options = options) #데이터 넣을 리스트 date_list = [] title_list = [] view_list = [] like_list = [] dislike_list = [] comment_list = [] #각 채널별 영상으로 크롤링 for i in range ( len (video_url)): start_url = video_url[i] print (start_url, end = ‘ / ‘ ) driver.get(start_url) driver.implicitly_wait( 1. 5 ) body = driver.find_element_by_tag_name( ‘body’ ) #댓글 null 값 방지 num_of_pagedowns = 2 while num_of_pagedowns: body.send_keys(Keys.PAGE_DOWN) time.sleep( 0. 5 ) num_of_pagedowns – = 1 time.sleep( 0. 5 ) #크롤링 요소 try : info = driver.find_element_by_css_selector( ‘.style-scope ytd-video-primary-info-renderer’ ).text.split( ‘

‘ ) if ‘인기 급상승 동영상’ in info[ 0 ] : info.pop( 0 ) elif ‘#’ in info[ 0 ].split( ‘ ‘ )[ 0 ] : info.pop( 0 ) title = info[ 0 ] divide = info[ 1 ].replace( ‘조회수 ‘ , ” ).replace( ‘,’ , ” ).split( ‘회’ ) view = divide[ 0 ] date = divide[ 1 ].replace( ‘ ‘ , ” ) like = info[ 2 ] dislike = info[ 3 ] driver.implicitly_wait( 1 ) try : comment = driver.find_element_by_css_selector( ‘#count > yt-formatted-string > span:nth-child(2)’ ).text.replace( ‘,’ , ” ) except : comment = ‘댓글x’ #리스트에 추가 title_list.append(title) view_list.append(view) date_list.append(date) like_list.append(like) dislike_list.append(dislike) comment_list.append(comment) # 크롤링 정보 저장 new_data = { ‘date’ :date_list, ‘title’ :title_list, ‘view’ :view_list, ‘comment’ : comment_list, ‘like’ :like_list, ‘dislike’ :dislike_list} df = pd.DataFrame(new_data) df.to_csv(f ‘data/{channel}/{channel}.csv’ , encoding = ‘utf-8-sig’ ) except : continue # 확인용 print (title, view, date, like, dislike, comment) driver.quit() Colored by Color Scripter cs

자막과 댓글이 필요 없을 경우

제목, 날짜, 조회수, 좋아요 수, 싫어요 수, 댓글 수만 크롤링을 합니다. 정보 양이 많지 않기 때문에 셀레니움만으로도 가능합니다. html_source를 bs4로 넘겼을 때와 비교해도 얼마 차이가 나지 않습니다.

# print(title, view, date, like, dislike, comment) 만약 어떤 정보가 나오는지 확인할 필요가 없으시면 비활성화해주세요.

나는 댓글과 자막도 필요하신 분들은

밑으로

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 from youtube_transcript_api import YouTubeTranscriptApi from konlpy.tag import Kkma from pykospacing import Spacing def ytb_subtitle(start_url, title) : try : code = start_url.split( ‘=’ )[ 1 ] srt = YouTubeTranscriptApi.get_transcript(f “{code}” , languages = [ ‘ko’ ]) #한글로, 딕셔너리 구조 text = ” for i in range ( len (srt)): text + = srt[i][ ‘text’ ] + ‘ ‘ text_ = text.replace( ‘ ‘ , ” ) #문장 분리 / kss 사용해도 무방 kkma = Kkma() text_sentences = kkma.sentences(text_) #종결 단어 lst = [ ‘죠’ , ‘다’ , ‘요’ , ‘시오’ , ‘습니까’ , ‘십니까’ , ‘됩니까’ , ‘옵니까’ , ‘뭡니까’ ,] df = pd.read_csv( ‘not_verb.csv’ ,encoding = ‘utf-8’ ) not_verb = df.stop.to_list() #단어 단위로 끊기 text_all = ‘ ‘ .join(text_sentences).split( ‘ ‘ ) for n in range ( len (text_all)) : i = text_all[n] if len (i) = = 1 : #한글자일 경우 추가로 작업x continue else : for j in lst : #종결 단어 #질문형 if j in lst[ 4 :]: i + = ‘?’ #명령형 elif j = = ‘시오’ : i + = ‘!’ #마침표 else : if i in not_verb : #특정 단어 제외 continue else : if j = = i[ len (i) – 1 ] : #종결 text_all[n] + = ‘.’ spacing = Spacing() text_all_in_one = ‘ ‘ .join(text_all) text_split = spacing(text_all_in_one.replace( ‘ ‘ , ” )).split( ‘.’ ) text2one = [] for t in text_split: text2one.append(t.lstrip()) w = ‘. ‘ .join(text2one) f = open (f ‘data/{channel}/subtitle/{title}.txt’ , ‘w’ ,encoding = ‘utf-8’ ) f.write(w) f. close () print ( ‘O’ ) except : print ( ‘X’ ) Colored by Color Scripter cs

유튜브 자막 추출 다운과 관련해서는 이전 글을 참고해주시면 좋을 것 같습니다. not_verb.csv 파일의 경우 ‘다’, ‘요’로 끝나는 단어 중 동사가 아닌 명사, 형용사 단어를 stop 컬럼으로 추가하시면 됩니다.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 # 영어 번역 없음 import winsound as sd from bs4 import BeautifulSoup import pyautogui import pandas as pd import re def beepsound(): fr = 2000 # range : 37 ~ 32767 du = 1000 # 1000 ms ==1second sd.Beep(fr, du) # winsound.Beep(frequency, duration) def ytb_info(video_url,channel) : print (f ‘{channel}’ , ‘ 크롤링 시작’ ) driver = webdriver.Chrome( ‘chromedriver.exe’ , options = options) # new_data = {‘date’: ”, ‘title’: ”, ‘view’: ”, ‘comment’: ”, ‘like’:”, ‘dislike’:”} count = 1 #데이터 넣을 리스트 date_list = [] title_list = [] view_list = [] like_list = [] dislike_list = [] comment_list = [] try : #각 채널별 영상으로 크롤링 for i in range ( len (video_url)): start_url = video_url[i] print (start_url, end = ‘ / ‘ ) driver.get(start_url) driver.implicitly_wait( 1. 5 ) body = driver.find_element_by_tag_name( ‘body’ ) #댓글 null 값 방지 num_of_pagedowns = 1 while num_of_pagedowns: body.send_keys(Keys.PAGE_DOWN) time.sleep( 0. 5 ) num_of_pagedowns – = 1 driver.implicitly_wait( 1 ) #크롤링 요소 try : info = driver.find_element_by_css_selector( ‘.style-scope ytd-video-primary-info-renderer’ ).text.split( ‘

‘ ) if ‘인기 급상승 동영상’ in info[ 0 ] : info.pop( 0 ) elif ‘#’ in info[ 0 ].split( ‘ ‘ )[ 0 ] : info.pop( 0 ) title = info[ 0 ] divide = info[ 1 ].replace( ‘조회수 ‘ , ” ).replace( ‘,’ , ” ).split( ‘회’ ) view = divide[ 0 ] date = divide[ 1 ].replace( ‘ ‘ , ” ) like = info[ 2 ] dislike = info[ 3 ] try : comment = driver.find_element_by_css_selector( ‘#count > yt-formatted-string > span:nth-child(2)’ ).text.replace( ‘,’ , ” ) except : comment = ‘댓글x’ #리스트에 추가 title_list.append(title) view_list.append(view) date_list.append(date) like_list.append(like) dislike_list.append(dislike) comment_list.append(comment) # 크롤링 정보 저장 new_data = { ‘date’ :date_list, ‘title’ :title_list, ‘view’ :view_list, ‘comment’ : comment_list, ‘like’ :like_list, ‘dislike’ :dislike_list} df = pd.DataFrame(new_data) df.to_csv(f ‘data/{channel}/-{channel}.csv’ , encoding = ‘utf-8-sig’ ) except : continue # print(title, view, date, like, dislike, comment) num_of_pagedowns = 1 while num_of_pagedowns: body.send_keys(Keys.PAGE_DOWN) time.sleep( 0. 5 ) num_of_pagedowns – = 1 #페이지 다운 last_page_height = driver.execute_script( “return document.documentElement.scrollHeight” ) while True : driver.execute_script( “window.scrollTo(0, document.documentElement.scrollHeight);” ) # driver.implicitly_wait(2) #오류남 time.sleep( 0. 5 ) new_page_height = driver.execute_script( “return document.documentElement.scrollHeight” ) if new_page_height = = last_page_height: break last_page_height = new_page_height # driver.implicitly_wait(1) time.sleep( 0. 75 ) time.sleep( 0. 5 ) # 댓글 크롤링 html = driver.page_source soup = BeautifulSoup(html, ‘lxml’ ) users = soup.select( “div#header-author > h3 > #author-text > span” ) comments = soup.select( “yt-formatted-string#content-text” ) user_list = [] review_list = [] for i in range ( len (users)): str_tmp = str (users[i].text) str_tmp = str_tmp.replace( ‘

‘ , ” ) str_tmp = str_tmp.replace( ‘\t’ , ” ) str_tmp = str_tmp.replace( ‘ ‘ , ” ) str_tmp = str_tmp.replace( ‘ ‘ , ” ) user_list.append(str_tmp) str_tmp = str (comments[i].text) str_tmp = str_tmp.replace( ‘

‘ , ” ) str_tmp = str_tmp.replace( ‘\t’ , ” ) str_tmp = str_tmp.replace( ‘ ‘ , ” ) review_list.append(str_tmp) # 댓글 추가 pd_data = { “ID” :user_list, “Comment” :review_list} youtube_pd = pd.DataFrame(pd_data) title = re.sub( ‘[-=+,#/\?:^$.@*\”※~&%ㆍ!』\\‘|\[\]\<\>`\’…《\》]’ , ” , title) youtube_pd.to_csv(f “data/{channel}/{title}.csv” , encoding = ‘utf-8-sig’ ) #,index_col = False) print ( ‘ㅁ’ ,end = ” ) # 자막 추출 ytb_subtitle(start_url, title) # 광고 끄기 if count : # time.sleep(1) try : driver.implicitly_wait( 0. 5 ) driver.find_element_by_css_selector( “#main > div > ytd-button-renderer” ).click() count – = 1 except : continue except : driver.quit() beepsound() driver.quit() beepsound() Colored by Color Scripter cs

기본 정보 / 댓글 / 자막까지

기본 정보 크롤링 밑으로 추가된 기능은 스크롤 다운 후, html page_source를 bs4로 넘겨서 댓글을 크롤링 합니다. 양이 많기 때문에 셀레니움보다 가볍고 빠른 bs4를 사용하시는 것을 추천드립니다.

댓글을 다 크롤링하고, 자막까지 받았을 때 영상 1개당 33초 정도 걸렸습니다. 컴퓨터, 인터넷 사양에 따라서 다를 거라 생각합니다. 한 채널이 끝날 때마다 소리가 나게 했습니다. 필요 없으면 꺼주세요!

*주의사항 *

유튜브 댓글은 기본적으로 인기 댓글순으로 정렬이 되어있기 때문에, 뒤에 있는 댓글일수록 공감을 적게 받거나 관심이 적은 댓글일 확률이 높습니다. 저는 모든 댓글이 필요하지 않기 때문에, 가장 크롤링이 빠르면서 댓글들 정보를 모을 수 있게 시간 설정 을 했습니다. 댓글이 적으면 모든 댓글을 크롤링하지만, 많아지면 60~90% 정도만 크롤링을 하게 됩니다.

모든 댓글들이 필요하신 분들은, time.sleep을 1초 이상으로 해주세요. driver.implicitly_wait의 경우 스크롤은 내려가는데 댓글들이 로딩이 되지 않는 경우가 있어서 time.sleep을 사용했습니다.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 #영어 번역 import pyautogui import pandas as pd import re def ytb_info(video_url,channel) : print (f ‘{channel}’ , ‘ 크롤링 시작’ ) driver = webdriver.Chrome( ‘chromedriver.exe’ , options = options) df = pd.DataFrame() count = 1 #각 채널별 영상으로 크롤링 for i in range ( len (video_url)): start_url = video_url[i] print (start_url, end = ‘/ ‘ ) driver.implicitly_wait( 1 ) driver.get(start_url) #영어 번역 pyautogui.hotkey( ‘shift’ , ‘F10’ ) for i in range ( 7 ): pyautogui.hotkey( ‘down’ ) pyautogui.hotkey( ‘enter’ ) body = driver.find_element_by_tag_name( ‘body’ ) #댓글 null 값 방지 num_of_pagedowns = 1 while num_of_pagedowns: body.send_keys(Keys.PAGE_DOWN) time.sleep(. 75 ) num_of_pagedowns – = 1 driver.implicitly_wait( 1 ) #크롤링 요소 info = driver.find_element_by_css_selector( ‘.style-scope ytd-video-primary-info-renderer’ ).text.split( ‘

‘ ) if ‘인기 급상승 동영상’ in info[ 0 ] : info.pop( 0 ) elif ‘#’ in info[ 0 ].split( ‘ ‘ )[ 0 ] : info.pop( 0 ) title = info[ 0 ] divide = info[ 1 ].replace( ‘조회수 ‘ , ” ).replace( ‘,’ , ” ).split( ‘회’ ) view = divide[ 0 ] date = divide[ 1 ].replace( ‘ ‘ , ” ) like = info[ 2 ] dislike = info[ 3 ] try : comment = driver.find_element_by_css_selector( ‘#count > yt-formatted-string > span:nth-child(2)’ ).text.replace( ‘,’ , ” ) except : comment = ‘댓글x’ # 크롤링 정보 저장 new_data = { ‘date’ :date, ‘title’ :title, ‘view’ :view, ‘comment’ : comment, ‘like’ :like, ‘dislike’ :dislike} df = df.append(new_data, ignore_index = True ) df.to_csv(f ‘data/{channel}/{channel}.csv’ , encoding = ‘utf-8-sig’ ) # print(title, view, date, like, dislike, comment) #페이지 다운 last_page_height = driver.execute_script( “return document.documentElement.scrollHeight” ) while True : driver.execute_script( “window.scrollTo(0, document.documentElement.scrollHeight);” ) time.sleep( 1 ) new_page_height = driver.execute_script( “return document.documentElement.scrollHeight” ) if new_page_height = = last_page_height: break last_page_height = new_page_height time.sleep( 1 ) time.sleep( 0. 5 ) #댓글 크롤링 review_list = [] user_list = [] reviews = driver.find_elements_by_css_selector( ‘#content-text’ ) users = driver.find_elements_by_css_selector( ‘h3.ytd-comment-renderer a span’ ) num = 0 for i in range ( len (users)): review = reviews[i].text.replace( ‘

‘ , ‘ ‘ ) review_list.append(review) user = users[i].text user_list.append(user) # 댓글 pd_data = { “ID” :user_list, “Comment” :review_list} youtube_pd = pd.DataFrame(pd_data) title = re.sub( ‘[-=+,#/\?:^$.@*\”※~&%ㆍ!』\\‘|\[\]\<\>`\’…《\》]’ , ” , title) youtube_pd.to_csv(f “data/{channel}/{title}.csv” , encoding = ‘utf-8-sig’ ) print ( ‘ㅁ’ ,end = ” ) # 자막 추출 ytb_subtitle(start_url, title) # 광고 끄기 if count : # time.sleep(1) try : driver.implicitly_wait( 0. 5 ) driver.find_element_by_css_selector( “#main > div > ytd-button-renderer” ).click() count – = 1 except : continue driver.quit() Colored by Color Scripter cs

해외 번역

단점 : headless으로 하면 안 된다. 마우스를 사용하지 못 한다. 시간이 진짜아아아아 엄처어어어엉 오래 걸린다. 굳이 이렇게 안 해도 될 거라고 생각이 드는데 혹시나 필요하신 분들을 위해서 남긴다.

가장 문제가 되는 부분이 번역을 한 정보는 bs4로 넘어가지 않는다. 셀레니움으로 모든 댓글과 닉네임들을 모아야 하기 때문에 시간이 오래 걸리는 것이다.

이 데이터들을 어떻게 사용할 것인지는 아직까지는 비밀.

유튜브 크롤링

이것도 심심해서 만들어 봤다.

검색어를 입력하면 해당 유튜브 영상의 정보를 csv파일로 저장하고

원하는 값을 입력하면 해당 url로 연결하여 영상을 재생하는 코드

pagedown은 귀찮아서 구현하지 않았다..

from selenium import webdriver from selenium.webdriver.common.keys import Keys import requests from bs4 import BeautifulSoup import pandas as pd import urllib.request import time from IPython.display import display import warnings warnings.filterwarnings(action=’ignore’) path = ‘path 값은 이용자의 브라우저 드라이버가 설치된 장소로 지정’

def get_video(): feature = input(‘검색어를 입력하시오 : ‘) driver = webdriver.Chrome(path) driver.get(‘https://www.youtube.com’) n = 3 while n > 0: print(‘웹페이지를 불러오는 중입니다..’ + ‘..’ * n) time.sleep(1) n -= 1 src = driver.find_element_by_xpath(‘//*[@id=”search”]’) src.send_keys(feature) src.send_keys(Keys.RETURN) n = 2 while n > 0: print(‘검색 결과를 불러오는 중입니다..’ + ‘..’ * n) time.sleep(1) n -= 1 print(‘데이터 수집 중입니다….’) html = driver.page_source soup = BeautifulSoup(html) df_title = [] df_link = [] df_writer = [] df_view = [] df_date = [] for i in range(len(soup.find_all(‘ytd-video-meta-block’, ‘style-scope ytd-video-renderer byline-separated’))): title = soup.find_all(‘a’, {‘id’ : ‘video-title’})[i].text.replace(‘

‘, ”) link = ‘https://www.youtube.com/’ + soup.find_all(‘a’, {‘id’ : ‘video-title’})[i][‘href’] writer = soup.find_all(‘ytd-channel-name’, ‘long-byline style-scope ytd-video-renderer’)[i].text.replace(‘

‘, ”).split(‘ ‘)[0] view = soup.find_all(‘ytd-video-meta-block’, ‘style-scope ytd-video-renderer byline-separated’)[i].text.split(‘•’)[1].split(‘

‘)[3] date = soup.find_all(‘ytd-video-meta-block’, ‘style-scope ytd-video-renderer byline-separated’)[i].text.split(‘•’)[1].split(‘

‘)[4] df_title.append(title) df_link.append(link) df_writer.append(writer) df_view.append(view) df_date.append(date) df_just_video = pd.DataFrame(columns=[‘영상제목’, ‘채널명’, ‘영상url’, ‘조회수’, ‘영상등록날짜’]) df_just_video[‘영상제목’] = df_title df_just_video[‘채널명’] = df_writer df_just_video[‘영상url’] = df_link df_just_video[‘조회수’] = df_view df_just_video[‘영상등록날짜’] = df_date df_just_video.to_csv(‘../data/df_just_video.csv’, encoding=’utf-8-sig’, index=False) driver.close() result = input(‘데이터프레임 저장이 완료되었습니다! 데이터프레임을 조회하시겠습니까? (y/n)’) if result == ‘y’: display(df_just_video) question = input(‘원하는 영상을 재생하시겠습니까? (y/n)’) if question == ‘y’: button = int(input(‘재생하고자 하는 영상의 번호(출력된 표 가장 왼쪽의 번호)를 입력해주세요.’)) driver = webdriver.Chrome(path) driver.get(df_just_video[‘영상url’][button]) else: return ‘프로그램을 종료합니다.’ else: return ‘프로그램을 종료합니다.’

실행 결과

대충 이런 식으로 실행된다.

번호 입력까지 완료하면 영상 url로 브라우저 자동 연결

파이썬 유튜브 크롤링 셀레니움 1편

반응형

파이썬 Selenium 유튜브(Youtube) 크롤링

목표 : 파이썬 자동화 모듈 selenium의 webdriver를 사용하여 유튜브에 원하는 검색어를 던져 나오는 영상 데이터를 대량 및 자동으로 수집하기

던지는 검색어에 따라 나오는 유튜브 영상의 썸네일, 제목, 조회수 등 동적인 Data를 수집해야 하므로 selenium 활용!

▶ 검색 리스트

sample data : 코스피 시가총액 상위 10개 종목

cd_idx cop_youtube_search price 1 삼성전자 80,500 2 SK하이닉스 127,000 3 NAVER 387,000 4 카카오 142,500 5 삼성전자우 74,200 6 LG화학 827,000 7 삼성바이오로직스 853,000 8 현대차 238,000 9 삼성SDI 639,000 10 셀트리온 281,000

※ 해당 Data는 유튜브 크롤링을 보여주기 위한 예시입니다.

▶ 모듈 import

# 데이터 처리 import pandas as pd from pandas import DataFrame import openpyxl # selenium from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager # webdriver-manager 패키지 다운로드 from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys # 시간 명시 from tqdm import tqdm import time # 오류 확인 import traceback

각 모듈은 Terminal에서 pip install이나 사용하는 파이썬 툴에 따라서 packages를 장착해주시면 됩니다!

* Selenium을 사용하기 위해선 webdriver가 필수적입니다. 따라서 webdriver 프로그램은 따로 깔아주셔야 합니다.

– webdriver 다운로드 사이트 바로가기

▶ webdriver option 설정

## Webdirver option 설정 options = webdriver.ChromeOptions() # options.add_argument(‘headless’) # 크롬 띄우는 창 없애기 options.add_argument(‘window-size=1920×1080’) # 크롬드라이버 창크기 options.add_argument(“disable-gpu”) #그래픽 성능 낮춰서 크롤링 성능 쪼금 높이기 options.add_argument(“user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36”) # 네트워크 설정 options.add_argument(“lang=ko_KR”) # 사이트 주언어 driver = webdriver.Chrome(ChromeDriverManager().install(),chrome_options=options)

webdriver 옵션에 대한 설명은 주석을 통해 달아놨습니다.

▶ 유튜브 사이트 살펴보기

유튜브 검색 화면

유튜브는 검색어를 던지고 스크롤을 내려야 영상에 대한 정보가 나타남.

( 동적인 data이므로 셀레니움 활용 )

따라서, 자동으로 스크롤을 내려주는 코드 필요!

# webdriver Scroll Down! driver.get(youtubeUrl) time.sleep(0.1) # driver.execute_script(“window.scrollTo(0, 80000)”) no_of_pagedowns = 10 elem = driver.find_element_by_tag_name(“body”) # print(“Scrolling Down!”) while no_of_pagedowns: # print(10 – no_of_pagedowns, “th Scroll”) elem.send_keys(Keys.PAGE_DOWN) time.sleep(0.5) no_of_pagedowns -= 1

▶ 검색어 리스트 다운로드

# 엑셀에서 검색어 추출 df = pd.read_excel(“c:/sample data.xlsx”, sheet_name= “기업정보”) # 경로 바꿔주셔야돼요! data_list = []
판다스 read_excel 활용하여 데이터 다운로드

다음편 보러가기

2021.06.22 – [Programming & Data Analysis/Python] – 파이썬 유튜브 크롤링 셀레니움 2편

반응형

[PYTHON] 파이썬 유튜브_크롤링 (COLDPLAY X BTS)

728×90

안녕하세요 파이썬과 관련하여 추가적으로 필요한 정보가 있으시면,

DATA101에서 확인 가능하십니다.

감사합니다.

안녕하세요, Hello

이번에는 유튜브 크롤링을 진행해보려고 합니다.

신사업 구축, 경쟁사 분석, 시장 동향 등 다양한 목적으로 유튜브 데이터를 수집하여, 활용할 수 있습니다.

– 댓글 내 이메일 주소 등을 활용한, 서비스 이용자 DB 확보

– 댓글 내 영상 시간을 활용한 구독자 하이라이트 검토

– 댓글 반응을 통한 영상 우호도 확인

– 댓글 텍스트 데이터를 활용한 머신러닝/딥러닝 학습 목적

데이터 수집에 활용할 영상은 2021년 9월 30일에 등록된, Coldplay X BTS – My Universe입니다.

kmong.com/gig/341599

별도의 유튜브 크롤링 데이터가 필요하면, 이미지의 링크를 통해 연락주시면 회신드리도록 하겠습니다.

페이지 구성

1. LIBRARY IMPORT

2. 크롤링 전 세팅

3. 소스코드

4. CSV 저장

1. LIBRARY IMPORT

LIBRARY는 아래와 같습니다

import pandas as pd import time from tqdm.auto import tqdm from selenium.webdriver import Chrome from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC

a. Chrome driver(크롬 드라이버) 설치

Selenium을 사용해서 웹 자동화를 위해 크롬 웹 드라이버를 설치합니다.

ChromeDriver – WebDriver for Chrome 사이트에 들어가서 Driver를 다운받습니다.

b. from selenium.webdriver.common.by import By

각 element에 따라 method를 여러개 사용하는 것 보다 By로 정리하기 쉽습니다.

driver.find_element(By.<속성>, ‘<속성 값>‘)으로 사용 가능하며,

복수의 항목을 찾을 경우에는, find_elements로 할 수 있습니다.

c. from selenium.webdriver.common.keys import Keys

키보드 입력을 입력하여, SCROLL UP, DOWN, 텍스트 입력 등을 수행할 수 있습니다.

키보드 입력은 send_keys(*value) 함수를 통해 할 수 있습니다.

d. from selenium.webdriver.support.ui import WebDriverWait

0. Wait till Load Webpage(로딩 대기)

브라우저에서 해당 웹 페이지의 요소들을 로드하는 데 시간이 걸립니다.

이로인해 element가 없다는 error를 회피하기 위해, 해당 요소가 전부 준비가 될 때까지 대기해야 합니다.

1. Implicit Waits(암묵적 대기)

driver.implicitly_wait(time_to_wait=5)

찾으려는 element가 로드될 때까지 지정한 시간만큼 대기할 수 있습니다.

2. Explicit Waits(명시적 대기)

time.sleep(secs)

element 값 여부에 상관없이 지정된 시간을 대기한다.

e. from selenium.webdriver.support import expected_conditions as EC

expected_conditions(EC)는 만약 element를 찾을 수 있었으면 True를, 아니라면 False를 반환합니다.

2. 크롤링 전 기본 세팅

# 크롤링 전 세팅 chrome_driver_path = r'(크롬드라이브 설치 경로)’ # 크롤링 URL # 향후 찾고 싶은 영상 URL만 변경 url_path = ‘https://www.youtube.com/watch?v=3YqPKLZF_WU’ # 크롤링 반복 횟수 repeat = 3

3. 소스코드

# 댓글 작성자 commenter_lst = [] # 댓글 comment_lst=[] # 좋아요 개수 like_count_lst = [] # 크리에이터 하트 여부 heart_exist_lst = [] with Chrome(executable_path = chrome_driver_path) as driver: # 찾으려는 대상이 불러올 때까지 지정된 시간만큼 대기하도록 설정한다. # 인자는 초(second) 단위이며, Default 값은 0초이다. wait = WebDriverWait(driver, 20) driver.get(url_path) # 영상 url time.sleep(3) # 유튜브 실행 시 자동 영상 재생일 경우, 영상 종료되면 바로 다음 영상으로 넘어가게 된다. # 이를 방지하기 위해, 유튜브 영상 중지 후 크롤링 진행 if driver.find_element_by_class_name(“ytp-play-button”).get_attribute(‘aria-label’) == ‘일시중지(k)’: driver.find_element_by_class_name(“ytp-play-button”).click() else: pass # 최초 1회 PAGE_DOWN wait.until(EC.visibility_of_element_located((By.TAG_NAME, “body”))).send_keys(Keys.PAGE_DOWN) time.sleep(3) # END 반복 실행 # 실행 횟수 체크 for item in tqdm(range(repeat)): # END버튼 반복 횟수, 1회당 20개씩 댓글 업데이트 wait.until(EC.visibility_of_element_located((By.TAG_NAME, “body”))).send_keys(Keys.END) time.sleep(1) # END버튼 클릭 이후, 1초 대기 후, 다시 END 버튼 진행 # 크롤링 데이터 수집 진행 # 작성자 가져오기 # 댓글 작성자 중 확인된 사용자, 공식 아티스트 채널 값은 text로 가져올 시, (공백) 처리됨 try: for commenter in tqdm(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ‘#author-text’)))): # 작성자 이름 없는 경우에, 공백 표시 # 확인된 사용자, 공식 아티스트 채널의 경우, innertext를 가져옴 if commenter.text != ”: commenter_lst.append(commenter.text) else: commenter_temp = commenter.get_attribute(“innerText”).strip().replace(‘

‘, ”) commenter_lst.append(commenter_temp) except: # 크롤링 값이 없을 경우에 commenter_lst.append(”) # 댓글 가져오기 try: for comment in tqdm(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ‘#content-text’)))): if comment.text != ”: comment_temp = comment.text.replace(‘

‘, ‘ ‘) comment_lst.append(comment_temp) else: comment_lst.append(‘ ‘) except: # 크롤링 값이 없을 경우에 comment_lst.append(”) # 좋아요 개수 가져오기 try: for like_count in tqdm(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ‘#vote-count-middle’)))): if like_count.text != ”: like_count_lst.append(like_count.text) else: like_count_lst.append(‘0’) except: # 좋아요 개수가 없을 경우에 like_count_lst.append(‘0’) # 크리에이터 하트 여부 체크하기 for creater_heart in tqdm(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ‘#creator-heart’)))): # 크리에이터 하트 html 존재 여부로 체크 try: if creater_heart.find_element_by_css_selector(‘#creator-heart-button’): heart_exist_lst.append(‘하트’) else: heart_exist_lst.append(‘없음’) except: # 크롤링 값이 없을 경우에 heart_exist_lst.append(‘없음’) print(‘done’)

4. CSV 저장

# 저장 위치 save_path = r'(파일 저장 위치)’ df = pd.DataFrame({‘댓글 작성자’ : commenter_lst, ‘댓글’ : comment_lst, ‘좋아요 개수’ : like_count_lst, ‘하트 유/무’: heart_exist_lst}) # 인덱스 1부터 실행 df.index = df.index+1 # to_csv 저장 df.to_csv(save_path + ‘유튜브 댓글 크롤링 ‘ + str((repeat +1) * 20) +’개 크롤링.csv’ , encoding=’utf-8-sig’) print(‘save done’)

이처럼 저장하게 되면 아래처럼 csv 파일이 저장됩니다

최종 결과물

■ 마무리

포스팅 내용이 학교 프로젝트, 데이터 구축, 업무 목적 등 다양한 목적에 도움이 되었으면 합니다.

위 포스팅은 카카오 티스토리, 네이버 블로그에도 동일하게 업로드합니다.

728×90

728×90

파이썬 유튜브 제목, 조회수 크롤링하기

반응형

안녕하세요. 오랫만에 파이썬 포스팅을 하게 되었네요.

오늘은 유튜브 페이지의 제목과 조회수를 크롤링 해보겠습니다.

유튜브 페이지 크롤링

저번에 포스팅과 같이 크롤링은 처음 구조 확인부터 시작되는데요.

유튜브 재생 플레이어의 경우 meta 정보에 포함되어 있습니다.

2021.02.17 – [Tip & Tech/Python] – 파이썬으로 네이버 스포츠 농구 일정 크롤링 하기

이렇게 제목이 있구요.

itemprop=”duration” 재생시간

itemprop=”interactionCount” 에는 조회수가 저장되어 있습니다.

크롤링 찾기

간단하게 BeautifulSoup으로 찾고 select_one을 통해서 내용을 찾았습니다.

select_one은 딱 하나만을 찾아주는 함수입니다.

자세한 용법은 공식페이지를 참고하시면 됩니다.

import requests from bs4 import BeautifulSoup doru = requests.get(‘https://youtu.be/UoRqHy07w8Q’) doru_text = bs4.BeautifulSoup(doru.text, ‘lxml’) title = doru_text.select_one(‘meta[itemprop=”name”][content]’)[‘content’] view = doru_text.select_one(‘meta[itemprop=”interactionCount”][content]’)[‘content’] print(title)

해당 기능을 업데이트 하기 위해서 CSV파일을 읽고 써서 하는 방식으로 변경해보았습니다.

먼저 원하는 주소를 url.csv 에 적어놓고, 파일을 읽습니다.

그리고 data라는 리스트에 저장합니다.

import requests from bs4 import BeautifulSoup import pandas as pd import time import csv data = list() f = open(“c:/python/url.csv”,’r’) rea = csv.reader(f) for row in rea: data.append(row) f.close yt_title = [] yt_view = [] time.sleep(0) for i in data: doru = requests.get(i[0]) doru_text = BeautifulSoup(doru.text, ‘html.parser’) try: title = doru_text.select_one(‘meta[itemprop=”name”][content]’)[‘content’] yt_title.append(title) except: title = i[0] yt_title.append(title) try: view = doru_text.select_one(‘meta[itemprop=”interactionCount”][content]’)[‘content’] yt_view.append(view) except: view = ‘없음’ yt_view.append(view) result = pd.DataFrame([yt_title,yt_view]) result.to_csv(‘c:/python/result.csv’,encoding=’euc-kr’)

이후 유튜브 제목과 조회수를 추출하여 yt_title, yt_view에 리스트로 저장한뒤

마지막은 to_csv를 통해서 result.csv로 출력하는 프로그램입니다.

다만 유튜브 링크중 짤린 것이 있어서 try와 except를 통해서 예외처리를 해주었습니다.

그리고 해당 코드를 셀레니움을 통해서도 가능한데요.

셀레니움 설치

먼제 Selenium을 쓰기 위해서는 먼저 설치를 해야합니다.

설치는 간단하게 pip install selenium 명령어를 통해서 할 수 있습니다.

셀레니움은 인터넷에 있는 코드를 활용해서

url.csv 파일에 있는 유튜브 주소를 불러와서 제목 조회수를 불러오는 형태로 만들어 보았습니다.

인터넷에 정말 잘 설명된 분들이 많아서 좋습니다.

import requests import pandas as pd import urllib.request from bs4 import BeautifulSoup from selenium.webdriver import Chrome from selenium.webdriver.common.keys import Keys import time url = pd.read_csv(“c:/down/url.csv”,encoding=’utf-8′) print(url) delay = 1 browser = Chrome(‘c:\down\chromedriver.exe’) browser.implicitly_wait(delay) browser.get(url[0]) browser.maximize_window() #창 화면 키우기 body = browser.find_element_by_tag_name(‘body’) pages = 2 while pages: body.send_keys(Keys.PAGE_DOWN) time.sleep(1) pages -= 1 soup = BeautifulSoup(browser.page_source,’html.parser’) # 제목, 조회수 조회 title = soup.select_one(‘meta[itemprop=”name”][content]’)[‘content’] print(title) try: view = soup.select_one(‘meta[itemprop=”interactionCount”][content]’)[‘content’] print(view) except: view = ‘없음’ print(view) browser.close()

그럼 잘 사용하시기 바랍니다.

반응형

[Python] 유튜브 콘텐츠 크롤러 코드 Version 1.0

728×90

반응형

본 포스팅에서는 파이썬 기반 유튜브 콘텐츠 Scrpaer 코드를 공유합니다.

📝 목차

1. 주요 기능

2. 크롬 설치

3. 전체 코드

4. 패키지 설치

5. 코드 설명

1. 주요 기능

1) 유튜브 내 검색 결과의 콘텐츠 정보 자동 Scrap

유튜브에서 특정 콘텐츠를 찾고자 우리는 여러 키워드를 사용하죠. 이번 포스팅에서는 유튜브에서 특정 키워드 검색을 통해 반환된 콘텐츠들의 정보를 모두 수집하는 Scraper를 만들어 보고자 합니다.

※ Scrap이란?

Scrap은 웹 페이지에서 특정 데이터를 가져오는 행동을 말합니다. 이처럼 Data Scraping을 하는 프로그램을 Data Sraper 또는 Web Scraper라고 부릅니다.

유튜브 내 특정 검색 결과의 콘텐츠를 자동 탐색하는 모습

(1) 수집 데이터 종류

– 콘텐츠 제목

– 콘텐츠 링크

2) 수집 데이터는 데이터 프레임 형태로 포맷팅

데이터 Scrap 결과

2. 크롬 설치

이 코드는 크롬에서만 동작합니다. 크롬 브라우저가 설치되어 있지 않다면 이곳을 클릭하셔서 미리 다운로드하여 주시길 바랍니다.

3. 전체 코드

전체 코드는 아래 링크(Github)에 업로드하였으며, src 폴더 내 주피터 노트북 파일에서 작업하였습니다. 전체 코드를 다운로드하여 주시길 바랍니다.

https://github.com/park-gb/youtube-content-scaper.git

4. 패키지 설치

코드 실행 전에 몇 가지 패키지를 설치해야 하며, 활용한 패키지 정보는 다음과 같습니다.

bs4==0.0.1 selenium==4.1.2 webdriver-manager==3.5.3 pandas==1.4.1 numpy==1.22.2

2가지 패키지 설치 방법을 제안해 드립니다.

방법 1. pipenv 활용(권장)

저는 파이썬에서 공식적으로 권장하는 가상환경 모듈인 pipenv을 활용하여 작업하였습니다. pipenv는 virtualenv, venv 등의 가상환경 모듈보다 훨씬 강력한 기능과 편의성을 제공하기 때문에, 가상환경 모듈을 사용하신다면 pipenv 사용을 추천합니다.

Github에서 다운받은 폴더 내 Pipfile 파일에 패키지 정보가 모두 저장되어 있습니다. 따라서 pipenv를 사용하신다면 아래 명령어 한 줄이면 필요한 모든 패키지를 버전까지 고려하여 자동으로 설치해 줍니다.

pipenv install

pipenv 사용방법 관련해서는 이곳에 자세히 정리해 두었으니 참고해 주시길 바랍니다.

방법 2. pip 활용

사용한 패키지를 일일이 다운로드하는 방법도 있습니다.

$ pip install bs4==0.0.1 $ pip install selenium==4.1.2 $ pip install webdriver-manager==3.5.3 $ pip install pandas==1.4.1 $ pip install numpy==1.22.2

5. 코드 설명

전체 코드는 Github에 업로드 했으니 누구나 다운로드하실 수 있습니다. 작업 프로세스나 코드에 부연설명이 필요하신 분들은 참고하시길 바랍니다.

1) 패키지 Import

from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from webdriver_manager.chrome import ChromeDriverManager from bs4 import BeautifulSoup import time import random import pandas as pd

필요한 패키지를 설치하고 import 합니다.

# 최신 크롬 드라이버 사용하도록 세팅: 현재 OS에 설치된 크롬 브라우저 버전에 맞게 cache에 드라이버 설치 from selenium.webdriver.chrome.service import Service service = Service(ChromeDriverManager().install())

크롬을 프로그램을 통해 제어하기 위해서는 크롬 드라이버가 필요합니다. 크롬 드라이버를 현재 PC에서 사용 중인 크롬 브라우저 버전에 맞는 파일을 다운로드하여 사용할 수도 있습니다. 하지만, 크롬 버전을 확인하고 이에 맞는 크롬 드라이버를 찾아 다운로드하여야 하기 때문에 번거로운 작업입니다.

Webdriver manager 패키지 내 Service 모듈은 현재 PC에서 사용 중인 크롬 브라우저 버전에 맞는 크롬 드라이버를 캐시에 저장하여 활용할 수 있도록 지원합니다. 크롬 드라이버 버전 호환의 간편함뿐만 아니라 크롬 드라이버를 직접 설치할 필요가 없다는 점이 큰 장점입니다.

2) 무한 스크롤 함수

def scroll(): try: # 페이지 내 스크롤 높이 받아오기 last_page_height = driver.execute_script(“return document.documentElement.scrollHeight”) while True: # 임의의 페이지 로딩 시간 설정 # PC환경에 따라 로딩시간 최적화를 통해 scraping 시간 단축 가능 pause_time = random.uniform(1, 2) # 페이지 최하단까지 스크롤 driver.execute_script(“window.scrollTo(0, document.documentElement.scrollHeight);”) # 페이지 로딩 대기 time.sleep(pause_time) # 무한 스크롤 동작을 위해 살짝 위로 스크롤(i.e., 페이지를 위로 올렸다가 내리는 제스쳐) driver.execute_script(“window.scrollTo(0, document.documentElement.scrollHeight-50)”) time.sleep(pause_time) # 페이지 내 스크롤 높이 새롭게 받아오기 new_page_height = driver.execute_script(“return document.documentElement.scrollHeight”) # 스크롤을 완료한 경우(더이상 페이지 높이 변화가 없는 경우) if new_page_height == last_page_height: print(“스크롤 완료”) break # 스크롤 완료하지 않은 경우, 최하단까지 스크롤 else: last_page_height = new_page_height except Exception as e: print(“에러 발생: “, e)

유튜브 웹 페이지는 스크롤을 해야 새로운 콘텐츠 정보를 제공하기 때문에, 모든 검색 결과를 확인하기 위해서는 반드시 무한 스크롤 기능이 필요합니다. 페이지 로딩을 일정 시간동안 기다리며 스크롤이 불가할 때까지 무한 반복하여 스크롤하는 함수입니다.

3) 데이터 Scrap

검색 키워드 설정

# 검색 키워드 설정: 키워드 내 띄어쓰기는 URL에서 ‘+’로 표시되기 때문에 이에 맞게 변환 SEARCH_KEYWORD = ‘잭 니콜라스 GC’.replace(‘ ‘, ‘+’)

유튜브에서 검색할 키워드를 입력합니다. 해당 검색어는 드라이버에서 접근할 URL에 활용됩니다. 유튜브의 경우, 검색어 내 띄어쓰기를 URL에서 + 기호로 표현한다는 점에서, replace 함수를 활용해 띄어쓰기를 자동으로 +로 변환하는 로직을 활용하였습니다. 저는 예시로 잭 니콜라스 Golf Club 관련 콘텐츠를 검색해 보고자 했습니다.

드라이버 세팅 및 실행

driver = webdriver.Chrome(service=service) # 스크래핑 할 URL 세팅 URL = “https://www.youtube.com/results?search_query=” + SEARCH_KEYWORD # 크롬 드라이버를 통해 지정한 URL의 웹 페이지 오픈 driver.get(URL) # 웹 페이지 로딩 대기 time.sleep(3) # 무한 스크롤 함수 실행 scroll()

크롬 드라이버를 변수에 할당하고, 유튜브 웹 페이지에서 키워드 검색 시 활용하는 URL 구조와 검색어를 조합합니다. 드라이버로 조합한 URL에 접근합니다. 일정 시간 페이지 로딩이 지나면 무한 스크롤 함수를 실행합니다.

페이지 소스 추출

# 페이지 소스 추출 html_source = driver.page_source soup_source = BeautifulSoup(html_source, ‘html.parser’)

페이지 소스를 추출합니다.

4) 데이터 추출

# 모든 콘텐츠 정보 content_total = soup_source.find_all(class_ = ‘yt-simple-endpoint style-scope ytd-video-renderer’) # 콘텐츠 제목만 추출 content_total_title = list(map(lambda data: data.get_text().replace(”

“, “”), content_total)) # 콘텐츠 링크만 추출 content_total_link = list(map(lambda data: “https://youtube.com” + data[“href”], content_total)) # 딕셔너리 포맷팅 content_total_dict = {‘title’ : content_total_title, ‘link’: content_total_link}

가장 먼저, 페이지 소스에서 콘텐츠와 관련된 데이터 추출하여 content_total 변수에 저장합니다. content_total 변수에서 콘텐츠별 제목과 링크를 추출하고, 해당 데이터를 딕셔너리 형태로 포맷팅 합니다.

5) 데이터 프레임 포맷팅

df = pd.DataFrame(content_total_dict) df

딕셔너리 기반의 데이터는 2차원 데이터 프레임으로 포맷팅하고, 데이터 상태를 확인합니다.

데이터 Scrap 결과

6) 데이터 저장

df.to_csv(“../data/content_total.csv”, encoding=’utf-8-sig’)

데이터가 잘 추출되었으니 로컬인 data 폴더에 저장합니다.

📝 참고할 만한 포스팅

오늘은 파이썬을 기반으로 유튜브 검색 결과의 콘텐츠 정보를 추출하는 Scraper 코드를 공유하는 시간을 가졌습니다.

포스팅 내용에 오류가 있거나, 코드 리뷰, 피드백, 질문 모두 대환영입니다. 아래에 👇👇👇 댓글 남겨주시면 감사드리겠습니다 🙂

그럼 오늘도 즐겁고 건강한 하루 보내시길 바랍니다.

고맙습니다 😀

728×90

반응형

유튜브 크롤링

안녕하세요. 업무를 하다가 문득 떠오른 아이디어가 있어서 유튜브 크롤링을 하게 되었습니다.

유튜버들은 유튜브만의 분석 솔루션이 따로 있다고 알고 있는데 “과연, 그것이 얼마나 도움을 줄 수 있을까?” 라는 생각과 “직접 데이터를 수집하여 분석하고 인사이트를 도출해줄 수 있지 않을까?”라는 생각을 하게 되었습니다.

따라서 주제는 유튜버들중 한분을 정해서 데이터 수집부터 분석까지 진행해보고자 합니다.

또한, 본 포스팅에서는 유튜버에 대한 크롤링을 해보고자 합니다.

저는 “이사배”라는 유튜버분을 크롤링했습니다.

저는 Selenium으로 크롤링을 진행했습니다. 크롤링을 하는 여러 방법중에 가장 직관적이지만 시간이 오래걸리고 복잡하고 오류가 많다는 단점이있죠.

1. 먼저 크롤링에 필요한 모듈을 import합니다

import requests from bs4 import BeautifulSoup import time import urllib.request # from selenium.webdriver import Chrome import re from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys import datetime as dt

2. url을 불러오기 위한 사전 작업을 실행합니다.

selenium은 간단히 말해서 컴퓨터 = 나 라는 생각을 하시면 됩니다.

delay=3 browser = Chrome() browser.implicitly_wait(delay)

2-1 우리의 대상은 유튜버기 때문에 유튜브 url로 접속합니다.

start_url = ‘https://www.youtube.com’ browser.get(start_url) browser.maximize_window()

3. 유튜브에 접속했으면 개발자도구를 켭니다(F12) 그러면 여러분들이 보고 있는 창의 page source들이 보일 겁니다.

개발자도구

저기 보이는 화살표를 클릭하거나 Ctrl+shift+c를 눌러줍니다.

우리가 필요한 것은 “이사배” 라는 유튜버를 검색해야 하므로 page source에서 검색창을 찾아줍니다.

검색창 page source화면

4. 이제 검색창 영역을 클릭한 뒤 검색하고 싶은 유튜버를 입력한 뒤 enter클릭합니다.

selenium이나 beautifulsoup를 이용하여 html 코드를 찾는 방법으로는 css / xpath / xml 등의 방법이 있습니다. 저는 여기서 xpath를 이용하여 원하는 html태그와 속성을 찾아보겠습니다.(공간활용과 더 자세하게 방법을 다루기 위해서 xpath, css selector방법은 따로 포스팅하겠습니다.)

browser.find_elements_by_xpath(‘//*[@id=”search-form”]/div/div/div/div[2]/input’)[0].click() #검색창영역클릭 browser.find_elements_by_xpath(‘//*[@id=”search-form”]/div/div/div/div[2]/input’)[0].send_keys(‘이사배’)#검색창 영역에 원하는 youtuber입력 browser.find_elements_by_xpath(‘//*[@id=”search-form”]/div/div/div/div[2]/input’)[0].send_keys(Keys.RETURN)#엔터

4-1. 이동한 화면에서 유튜버 “이사배”를 클릭합니다.

browser.find_elements_by_xpath(‘//*[@class=”yt-simple-endpoint style-scope ytd-channel-renderer”]/div[2]/h3/span’)[0].click()

이제 제가 추출하고 싶은 정보는 “이사배”님의 영상정보와 댓글입니다. 따라서 먼저 영상별로 정보를 수집해보겠습니다.

5. 유튜버에 들어간 뒤 카테고리가 보이는데 여기서 동영상 카테고리를 클릭합니다.

유튜버의 동영상 카테고리 클릭

browser.find_element_by_xpath(‘//*[@class=”scrollable style-scope paper-tabs”]/paper-tab[2]’).click()

5-1. 유튜브는 스크롤을 내리지 않으면 모든 영상들이 page source에 보이지 않기 때문에 스크롤을 내리는 작업을 선행해줍니다.(시간상 스크롤 끝까지 내리지 않고 20회정도로 한정하여 스크롤해보겠습니다.)

body = browser.find_element_by_tag_name(‘body’)#스크롤하기 위해 소스 추출 num_of_pagedowns = 20 #10번 밑으로 내리는 것 while num_of_pagedowns: body.send_keys(Keys.PAGE_DOWN) time.sleep(2) num_of_pagedowns -= 1

5-2. 위의 과정을 거치면 거의 모든 영상정보에 대해서 source가 보이는 것을 확인할 수 있습니다. 그러면 이제 내가 원하는 정보를 수집해봅시다!

저는 영상정보에서 영상이름/썸네일/조회수 등의 영상에서 필요한 대부분의 정보를 수집했습니다.

여기서 중요한 것은 selenium은 굉장히 오래걸리기 때문에 selenium으로 필요한 부분까지만 진행하고 나머지는 beautiful soup, request로 작업하는 것이 시간과 노력이 굉장히 절감됩니다.

내가 작업하고 있는 환경의 page source를 추출하는 코드는 매우 쉽습니다.

html0 = browser.page_source html = BeautifulSoup(html0,’html.parser’)

해당 과정을 거치면 beautiful soup를 이용하여 크롤링을 진행할 수 있습니다.

6. 추출한 결과를 보여드리겠습니다 추출하기 위한 코드는 길기때문에 모든 정보를 담을 수 없어 github에 올려놓도록 하겠습니다.

https://github.com/minyong-shin/Bloging/tree/master/

영상정보에서 추출한 데이터

여기까지 유튜버들의 영상정보를 크롤링 하는 방법을 알아봤는데요, 해당 데이터로 제가 해보려고 하는 것은 “시각화”를 해보려고 합니다. 유튜버 “이사배”님의 영상제목을 tokenize하여 명사와 형용사를 추출하여 제목에 어떤 명사, 형용사가 들어갔을 때 조회수, 댓글, 좋아요 수가 많은지를 시각화해보려고 합니다.

과연 어떤 단어가 들어갔을 때 조회수, 댓글, 좋아요 수가 많은지 다음 포스팅에서 뵙겠습니다.

유튜브 크롤링하기(제목, 주소, 조회수)

728×90

from selenium import webdriver from bs4 import BeautifulSoup as bs import pandas as pd from selenium.webdriver.common.keys import Keys import time keyword = ‘오마이걸’ url = ‘https://www.youtube.com/results?search_query={}’.format(keyword) driver = webdriver.Chrome(‘./chromedriver.exe’) driver.get(url) soup = bs(driver.page_source, ‘html.parser’) driver.close() name = soup.select(‘a#video-title’) video_url = soup.select(‘a#video-title’) view = soup.select(‘a#video-title’) name_list = [] url_list = [] view_list = [] for i in range(len(name)): name_list.append(name[i].text.strip()) view_list.append(view[i].get(‘aria-label’).split()[-1]) for i in video_url: url_list.append(‘{}{}’.format(‘https://www.youtube.com’,i.get(‘href’))) youtubeDic = { ‘제목’: name_list, ‘주소’: url_list, ‘조회수’: view_list } youtubeDf = pd.DataFrame(youtubeDic) youtubeDf.to_csv(‘오마이걸유튜브.csv’, encoding=”, index=False)

– 검색하는 값에 따라 크롤링을 할 수 있게 keword를 설정해준다.

– a id =’video-title’ 안에 제목/주소/조회수가 다 들어있는 걸 볼 수 있다.

– get 함수를 사용해 속성값을 빼내주고

– 조회수는 aria-label 속성에서 split()을 사용해 문자열을 나눠주고 마지막 값인[-1] 조회수로 사용해준다.

– 이모티콘 때문에 저장아 안되서 인코딩은 공백으로 표시한다.

짜란 완성!!

728×90

반응형

키워드에 대한 정보 유튜브 크롤링

다음은 Bing에서 유튜브 크롤링 주제에 대한 검색 결과입니다. 필요한 경우 더 읽을 수 있습니다.

이 기사는 인터넷의 다양한 출처에서 편집되었습니다. 이 기사가 유용했기를 바랍니다. 이 기사가 유용하다고 생각되면 공유하십시오. 매우 감사합니다!

사람들이 주제에 대해 자주 검색하는 키워드 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ]

R크롤링
크롤링
R셀레니움
댓글수집
제목수집
유튜브댓글
유튜브크롤링
유튜브댓글수집
유튜브제목수집
RSelenium
셀레니움
r crawling
crawling

유튜브제목과 #댓글을 ##크롤링! #[ ##R쓸신잡 #R #Selenium #제 #2 #편 #]

YouTube에서 유튜브 크롤링 주제의 다른 동영상 보기

주제에 대한 기사를 시청해 주셔서 감사합니다 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ] | 유튜브 크롤링, 이 기사가 유용하다고 생각되면 공유하십시오, 매우 감사합니다.

유튜브 크롤링 주제에 대한 동영상 보기

d여기에서 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ] – 유튜브 크롤링 주제에 대한 세부정보를 참조하세요

유튜브 크롤링 주제에 대한 자세한 내용은 여기를 참조하세요.

유튜브 크롤링(3) 올인원 – 채널 제목, 댓글, 조회수, 자막까지

유튜브 크롤링 – velog

파이썬 유튜브 크롤링 셀레니움 1편 – 코딩하는 금융인

유튜브크롤링

[PYTHON] 파이썬 유튜브_크롤링 (COLDPLAY X BTS)

파이썬 유튜브 제목, 조회수 크롤링하기 – Dorulog

[Python] 유튜브 콘텐츠 크롤러 코드 Version 1.0 – Hey Tech

[Python] 파이썬으로 유튜브 크롤링 – 1 – 개인용 복습공간

유튜브 크롤링

유튜브 크롤링하기(제목, 주소, 조회수) – 잘 먹고 잘사는 법 159

주제와 관련된 이미지 유튜브 크롤링

주제에 대한 기사 평가 유튜브 크롤링

유튜브 크롤링(3) 올인원 – 채널 제목, 댓글, 조회수, 자막까지

유튜브 크롤링

파이썬 유튜브 크롤링 셀레니움 1편

[PYTHON] 파이썬 유튜브_크롤링 (COLDPLAY X BTS)

파이썬 유튜브 제목, 조회수 크롤링하기

[Python] 유튜브 콘텐츠 크롤러 코드 Version 1.0

유튜브 크롤링

유튜브 크롤링하기(제목, 주소, 조회수)

키워드에 대한 정보 유튜브 크롤링

사람들이 주제에 대해 자주 검색하는 키워드 유튜브제목과 댓글을 #크롤링! [ #R쓸신잡 R Selenium 제 2 편 ]

Leave a Comment Cancel reply