크롤러?

크롤러는 웹상의 데이터를 긁는 프로그램을 이야기한다.
파이썬에서는 크롤러를 지원해주는 beautifulsoup이라는 아주 훌륭한 라이브러리가 있다.
어떻게 사용하는지 알아보자.

설치

pip3 install beautifulsoup4

간단히 pip install을 이용해 설치한다.

페이지를 크롤링 해보자.

크롤러는 웹데이터를 긁는것이고, 간단히 html을 가져와서 내가 필요한부분을 알맞게 파싱하는 것이다.
기본적으로 홈페이지 html을 파싱해보자.

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)     Chrome/37.0.2049.0 Safari/537.36'
}
r = requests.get("naver.com", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser', from_encoding='utf-8')

가장 중요한 head의 설정이다. userAgency를 설정하게되는데 일반적으로 데스크탑의 웹과같이 같은 html 데이터를 받기위해서 필수적인 작업이된다. 만일 헤더를 설정하지 않는다면 기대했던 화면이 나오지 않을 수있다.
requests를 통해 get으로 해당 페이지에 요청을 보내고 해당 컨테츠내용을 BeautifulSoup에 넣고 htmlparser옵션을 준다. 이제 우리는 크롤링 할 준비가 되었다.

sample1

크롤러는 사실 크롤러 라이브러리가 얼마나 쉽게 되어있는지에 따라 난이도가 갈리게되는데 beautfiulSoup는 굉장히 beautiful하다.
크롤링 데이터는 단계적으로 찾아가는게 가장 쉽다.

<div class="area_newsstand">
<div class="an_menulist">
<h3 class="an_tit">
<a href="http://newsstand.naver.com/" class="an_ta" target="_blank" data-clk="nsd.title"><span class="blind">뉴스스탠드</span></a>
</h3>
<div class="an_menulist_section1">
<div class="an_sort" role="tablist">
<a class="as_btn_press _PM_newsstand_total_type is_selected" href="#" role="tab" aria-selected="true" data-clk="nsd.all"><span class="blind">전체 언론사</span></a>
<span class="as_bar" role="presentation"></span>
<a class="as_btn_my _PM_newsstand_my_type" href="#" role="tab" aria-selected="false" data-clk="nsd.my"><span class="blind">MY 뉴스</span></a>
</div>
</div>
<div class="an_menulist_section2">
<div class="an_sort2" role="tablist">
<a class="as2_btn _PM_newsstand_thumb_type is_selected" href="#" role="tab" aria-selected="true" data-clk="nsd.pressview"><i class="as2_btn_ico ico_image"></i><span class="blind">이미지형</span></a>
<a class="as2_btn _PM_newsstand_list_type" href="#" role="tab" aria-selected="false" data-clk="nsd.articleview"><i class="as2_btn_ico ico_list"></i><span class="blind">리스트형</span></a>
<a class="as2_btn" href="http://newsstand.naver.com/config.html" data-clk="nsd.set" target="_blank"><i class="as2_btn_ico ico_setting"></i><span class="blind">설정</span></a>
</div>
<ul class="an_paging">
<li class="ap_list"><a class="ap_btn _PM_newsstand_prev" href="#" data-clk="nsd.prev"><i class="ap_btn_ico ico_left"></i><span class="blind">이전 페이지</span></a></li>
<li class="ap_list"><a class="ap_btn _PM_newsstand_next" href="#" data-clk="nsd.next"><i class="ap_btn_ico ico_right"></i><span class="blind">다음 페이지</span></a></li>
</ul>
</div>
</div>

우리가 naver.com의 html에서 위와같은 정보만을 가져오려면 어떻게 해야할까?
beatuifulsoup에서 제공하는 아래와같은 함수를 이용하면 된다.

news_stand = soup.findAll("div", {"class":"area_newsstand"})

이와같이 tag와 property로 간단히 해당 블락을 파싱해 가져올 수 있다.
같은식의 반복으로 세부 정보도 파싱이 가능할 것이다.

sample2

tag내의 값 자체를 가져오려면 .text를 붙여주면된다.

<span class="blind">언론사 목록</span>

위와같은 상황에서 ₩언론사 목록₩을 가져오려면 아래와같이 코딩해주면된다.

soup.findAll("span", {"class":"blind"}).text

sample3

1,2 번에서 가장 자주쓰는 함수를 보았다.
그 다음 유용한 함수는 아래와 같이 href 혹 src의 정보를 가져오는 함수이다.

<a class="api_link" href="http://newsstand.naver.com/?list=ct1&pcode=015" aria-haspopup="true" target="_blank">
<img src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd172736175.png" height="24" alt="한국경제" class="api_logo">
</a>

*위 와같은 정보에서 href의 실제 value를 가져오려면 어떻게 할까?

soup.find("a").get('href')

간단하다. get을 이용해 해당 href를 가져올 수 있다.
이외에도 수많은 편리한 함수들을 제공해주고 있다.

마치며

데이터가 넘치는 세상에서 크롤러는 매우 중요한 의미를 가진다.
빅데이터 분석, 서비스 필요지표 등을 산정할 때 등 그 활용도가 높고 기본적으로 알고있으면 매우 좋은 내용이다.

참고 & 출처 beautfiSoupDoc

python - 크롤러

크롤러?

설치

페이지를 크롤링 해보자.

sample1

sample2

sample3

마치며

CalyFactory

Comments

크롤러?

설치

페이지를 크롤링 해보자.

sample1

sample2

sample3

마치며

Share

CalyFactory

Comments