크롤러?


설치

pip3 install beautifulsoup4

페이지를 크롤링 해보자.

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)     Chrome/37.0.2049.0 Safari/537.36'
}
r = requests.get("naver.com", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser', from_encoding='utf-8')



sample1

<div class="area_newsstand">
<div class="an_menulist">
<h3 class="an_tit">
<a href="http://newsstand.naver.com/" class="an_ta" target="_blank" data-clk="nsd.title"><span class="blind">뉴스스탠드</span></a>
</h3>
<div class="an_menulist_section1">
<div class="an_sort" role="tablist">
<a class="as_btn_press _PM_newsstand_total_type is_selected" href="#" role="tab" aria-selected="true" data-clk="nsd.all"><span class="blind">전체 언론사</span></a>
<span class="as_bar" role="presentation"></span>
<a class="as_btn_my _PM_newsstand_my_type" href="#" role="tab" aria-selected="false" data-clk="nsd.my"><span class="blind">MY 뉴스</span></a>
</div>
</div>
<div class="an_menulist_section2">
<div class="an_sort2" role="tablist">
<a class="as2_btn _PM_newsstand_thumb_type is_selected" href="#" role="tab" aria-selected="true" data-clk="nsd.pressview"><i class="as2_btn_ico ico_image"></i><span class="blind">이미지형</span></a>
<a class="as2_btn _PM_newsstand_list_type" href="#" role="tab" aria-selected="false" data-clk="nsd.articleview"><i class="as2_btn_ico ico_list"></i><span class="blind">리스트형</span></a>
<a class="as2_btn" href="http://newsstand.naver.com/config.html" data-clk="nsd.set" target="_blank"><i class="as2_btn_ico ico_setting"></i><span class="blind">설정</span></a>
</div>
<ul class="an_paging">
<li class="ap_list"><a class="ap_btn _PM_newsstand_prev" href="#" data-clk="nsd.prev"><i class="ap_btn_ico ico_left"></i><span class="blind">이전 페이지</span></a></li>
<li class="ap_list"><a class="ap_btn _PM_newsstand_next" href="#" data-clk="nsd.next"><i class="ap_btn_ico ico_right"></i><span class="blind">다음 페이지</span></a></li>
</ul>
</div>
</div>
news_stand = soup.findAll("div", {"class":"area_newsstand"})


sample2

<span class="blind">언론사 목록</span>

위와같은 상황에서 ₩언론사 목록₩을 가져오려면 아래와같이 코딩해주면된다.

soup.findAll("span", {"class":"blind"}).text


sample3

<a class="api_link" href="http://newsstand.naver.com/?list=ct1&pcode=015" aria-haspopup="true" target="_blank">
<img src="https://s.pstatic.net/static/newsstand/up/2017/0424/nsd172736175.png" height="24" alt="한국경제" class="api_logo">
</a>

*위 와같은 정보에서 href의 실제 value를 가져오려면 어떻게 할까?

soup.find("a").get('href')



마치며



참고 & 출처 beautfiSoupDoc