[Python] 파이썬, 네이버 뉴스 웹 크롤링 데이터 수집

파이썬을 이용하여 네이버 뉴스의 제목과 기사 내용을 불러오겠습니다.

<완성본>

1. 준비물

Vscode, Python

2. 설치

2-1. 웹 드라이버 설치(크롬)

Chrome: https://sites.google.com/chromium.org/driver/

ChromeDriver - WebDriver for Chrome

WebDriver is an open source tool for automated testing of webapps across many browsers. It provides capabilities for navigating to web pages, user input, JavaScript execution, and more. ChromeDriver is a standalone server that implements the W3C WebDriver

sites.google.com

설치 후, 다운로드 받은 드라이버를 원하는 경로에 이동시켜 줍니다.

※ 본인이 사용하는 브라우저 버전과, 다운로드한 드라이버 버전이 동일해야 합니다!

2-2. 패키지 설치

selenium은 3버전으로 작성되었습니다.

최신 selenium은 4버전으로 아래 코드와 맞지 않을 수 있습니다.

아래 코드를 사용할 경우 pip install selenium==3 으로 설치해주세요.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import sys
import getpass

# username
username = getpass.getuser()

# save
sys.stdout = open('C:\\Users\\'+username+'\\Desktop\\news_data.txt','a', encoding='UTF-8')

# webdriver
path = 'C:\\Users\\'+username+'\\Desktop\\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.implicitly_wait(3)

3. 뉴스 제목 크롤링

def title_search(url, date, page):
    url = url + date + '&page=' + str(page)
    driver.get(url)
    time.sleep(1)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

    # 1. 기사제목
    title_list = []  # 기사제목
    for i in soup.find('div',{'class':'list_body newsflash_body'}).find_all('a'):
        title_text = i.get_text(strip=True)
        #title_text = title_text.strip('\n''\t'' ') # \n, \t, 공백 문자열 제거
        #title_text = title_text.replace(' ','')
        #title_text = title_text.replace(',','')	# , 문자 제거
        #title_text = title_text.replace('"','')	# " 문자 제거
        title_list.append(title_text)
    title_list = list(filter(None, title_list)) # 빈 리스트 삭제

    # 2. href
    href_list = []  # href
    # href 주소
    for i in soup.find('div',{'class':'list_body newsflash_body'}).find_all('li'):
        href_list.append([i.find('a')['href']])


    # 딕셔너리
    data_dic = dict(zip(title_list, href_list))

    return data_dic
 
 
# ------------------------------------------------------------------------------------------
# URL
url_date = '20220309'
url_page = 6    # max page(페이지 입력)
url = 'https://news.naver.com/main/list.naver?mode=LS2D&sid2=259&mid=shm&sid1=101&date='
# ------------------------------------------------------------------------------------------


data_dic_all = dict()
# main
# title, href 서치
for page in range(url_page):
    result = title_search(url, url_date, page+1)
    for i in list(result):
        if i in list(data_dic_all):
            del result[i]
    data_dic_all.update(result)

4. 기사 내용 크롤링

def content_search(url):
    driver.get(url)
    time.sleep(1)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    
    content_list = []
    for i in soup.find('div',{'id':'articleBodyContents'}):
        content_text = i.get_text()
        content_text = content_text.strip('\n''\t'' ') # \n, \t 문자열 제거
        content_text = content_text.replace('\'','"')
        #content_text = content_text.replace(' ','')
        content_list.append(content_text)
    #content_list = [i.strip('\n') for i in content_list[:]]
    content_list = list(filter(None, content_list))
    content_list.remove(content_list[0])

    content = ''
    for i in content_list:
        content = content + ' ' + i

    return content
 
 # 기사내용 서치
for i in data_dic_all:
    content = content_search(data_dic_all[i][0])
    data_dic_all[i].append(content)

5. 결과 출력

결과는 메모장에 저장됩니다.

'C:\Users\'+username+'\Desktop\news_data.txt'

# 결과
for i in data_dic_all:      # {'타이틀',['주소','기사내용']}
    print('뉴스 제목: ' + str(i))
    print('주소: ' + str(data_dic_all[i][0]))
    print('내용: ' + str(data_dic_all[i][1]))
    print('')

# webdriver 종료
driver.quit()

저작자표시

'컴퓨터 > Python' 카테고리의 다른 글

[Python] 파이썬, PDF 파일을 이미지로 변환 (pdf2image) (0)	2022.05.30
[Python] 파이썬, 공휴일 조회하여 출력하기 (공공데이터포털) (0)	2022.05.05
[Python] 파이썬, 시계열 주식 예측 fbprophet (0)	2022.01.16
[Python] 파이썬, for문에서 remove() 함수 쓸 때 반드시 확인할 것 (0)	2022.01.07
[Python] 파이썬, Matplotlib 실시간 주식 차트 업데이트 자동화 만들기 (6)	2022.01.06

sjblog

[Python] 파이썬, 네이버 뉴스 웹 크롤링 데이터 수집

1. 준비물

2. 설치

2-1. 웹 드라이버 설치(크롬)

2-2. 패키지 설치

3. 뉴스 제목 크롤링

4. 기사 내용 크롤링

5. 결과 출력

'컴퓨터 > Python' 카테고리의 다른 글

티스토리툴바

[Python] 파이썬, 네이버 뉴스 웹 크롤링 데이터 수집

1. 준비물

2. 설치

2-1. 웹 드라이버 설치(크롬)

2-2. 패키지 설치

3. 뉴스 제목 크롤링

4. 기사 내용 크롤링

5. 결과 출력

'컴퓨터 > Python' 카테고리의 다른 글

관련글

티스토리툴바