파이썬

파이썬 웹 크롤링(Web Crawling) 2. html긁어모으기/태그검색

mcdn 2020. 8. 24. 17:00
반응형

https://webnautes.tistory.com/779

 

파이썬 웹 크롤링(Web Crawling) 강좌 - 1. 웹페이지 긁어오기

Beautiful Soup를 사용하여 간단한 웹 크롤러를 만드는 방법을 다루고 있습니다. Python 3.6으로 코드를 작성하였습니다. 버전의 차이로 필요한 모듈이 달라질 수도 있습니다. 웹 크롤러(Web Crawler)는 웹

webnautes.tistory.com

 

1. html 전체 다 긁어 모으기 1탄 : urlopen / BeatifulSoup(html, "")
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.naver.com")

bsObject = BeautifulSoup(html, "html.parser")


print(bsObject)

이 명령어를 실행하면 모든 html 페이지 소스가 나타난다. 

 

urlopen 함수를 사용하여 원하는 주소로부터 웹페이지를 가져온 후,  BeautifulSoup 객체로 변환합니다.

BeautifulSoup 객체는 웹문서를 파싱한 상태입니다. 웹 문서가 태그 별로 분해되어 태그로 구성된 트리가 구성됩니다.

포함하는 태그가 부모가 되고 포함된 태그가 자식이 되어 트리를 구성하고 있습니다.

예를 들어 html 태그아래에 head와 body 태그가 존재하고 다시 head와 body 태그 아래에 하위 태그가 존재합니다.

 

결과 
C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py

<!DOCTYPE html>
 <html data-dark="false" lang="ko"> <head> <meta charset="utf-8"/> <title>NAVER</title> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="width=1190" name="viewport"/> <meta content="NAVER" name="apple-mobile-web-app-title"> <meta content="index,nofollow" name="robots"> <meta content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요" name="description"> <meta content="네이버" property="og:title"/> <meta content="https://www.naver.com/" property="og:url"/> <meta content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png" property="og:image"/> <meta content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요" property="og:description"> <meta content="summary" name="twitter:card"/> <meta content="" name="twitter:title"/> <meta content="https://www.naver.com/" name="twitter:url"/> <meta content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png" name="twitter:image"/> <meta content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요" name="twitter:description"> <link href="https://pm.pstatic.net/dist/css/nmain.20200806.css" rel="stylesheet"/> <link href="https://ssl.pstatic.net/sstatic/search/pc/css/api_atcmp_200709.css" rel="stylesheet"/> <link href="/favicon.ico?1" rel="shortcut icon" type="image/x-icon"> <script defer="defer" src="https://pm.pstatic.net/dist/lib/nelo.20200617.js" type="text/javascript"></script> <script>document.domain="naver.com",window.nmain=window.nmain||{},window.nmain.supportFlicking=!1;var nsc="navertop.v4",ua=navigator.userAgent;window.nmain.isIE=navigator.appName&&0<navigator.appName.indexOf("Explorer")&&ua.toLocaleLowerCase().indexOf("msie 10.0")<0,document.getElementsByTagName("html")[0].setAttribute("data-useragent",ua),window.nmain.isIE&&(Object.create=function(n){function a(){}return a.prototype=n,new a})</script> <script>var darkmode= false;window.naver_corp_da=window.naver_corp_da||{main:{}},window.naver_corp_da.main=window.naver_corp_da.main||{},window.naver_corp_da.main.darkmode=darkmode</script> <script> window.nmain.gv = {  isLogin: false,
useId: null,   daInfo: {"ANIMAL":{"menu":"ANIMAL","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000161","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_animal_1","tb":"ANIMAL_1","unit":"SU10567","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000162","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_animal_2","tb":"ANIMAL_1","unit":"SU10568","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"BEAUTY":{"menu":"BEAUTY","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000163","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_beauty_1","tb":"BEAUTY_1","unit":"SU10595","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000164","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_beauty_2","tb":"BEAUTY_1","unit":"SU10596","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"BUSINESS":{"menu":"BUSINESS","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000165","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_business_1","tb":"BUSINESS_1","unit":"SU10577","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000166","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_business_2","tb":"BUSINESS_1","unit":"SU10578","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"CARGAME":{"menu":"CARGAME","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000167","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_cargame_1","tb":"CARGAME_1","unit":"SU10587","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000168","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_cargame_2","tb":"CARGAME_1","unit":"SU10588","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"CHINA":{"menu":"CHINA","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000169","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_china_1","tb":"CHINA_1","unit":"SU10591","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000170","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_china_2","tb":"CHINA_1","unit":"SU10592","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"DESIGN":{"menu":"DESIGN","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000171","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_design_1","tb":"DESIGN_1","unit":"SU10569","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000172","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_design_2","tb":"DESIGN_1","unit":"SU10570","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"FARM":{"menu":"FARM","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000173","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_farm_1","tb":"FARM_1","unit":"SU10561","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000174","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_farm_2","tb":"FARM_1","unit":"SU10562","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"FINANCE":{"menu":"FINANCE","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000175","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_finance_1","tb":"FINANCE_1","unit":"SU10563","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000176","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_finance_2","tb":"FINANCE_1","unit":"SU10564","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"ITTECH":{"menu":"ITTECH","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000177","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_ittech_1","tb":"ITTECH_1","unit":"SU10593","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000178","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_ittech_2","tb":"ITTECH_1","unit":"SU10594","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"JOB":{"menu":"JOB","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000179","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_job_1","tb":"JOB_1","unit":"SU10589","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000180","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_job_2","tb":"JOB_1","unit":"SU10590","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"LAW":{"menu":"LAW","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000181","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_law_1","tb":"LAW_1","unit":"SU10573","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000182","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_law_2","tb":"LAW_1","unit":"SU10574","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"LIVING":{"menu":"LIVING","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000183","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_living_1","tb":"LIVING_1","unit":"SU10597","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000184","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_living_2","tb":"LIVING_1","unit":"SU10606","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"LIVINGHOME":{"menu":"LIVINGHOME","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000185","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_livinghome_1","tb":"LIVINGHOME_1","unit":"SU10571","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000186","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_livinghome_2","tb":"LIVINGHOME_1","unit":"SU10572","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"MOMKIDS":{"menu":"MOMKIDS","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000187","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_momkids_1","tb":"MOMKIDS_1","unit":"SU10575","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000188","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_momkids_2","tb":"MOMKIDS_1","unit":"SU10576","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"MOVIE":{"menu":"MOVIE","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000189","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_movie_1","tb":"MOVIE_1","unit":"SU10585","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000190","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_movie_2","tb":"MOVIE_1","unit":"SU10586","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"SCHOOL":{"menu":"SCHOOL","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000191","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_school_1","tb":"SCHOOL_1","unit":"SU10579","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000192","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_school_2","tb":"SCHOOL_1","unit":"SU10580","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"SHOW":{"menu":"SHOW","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000193","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_show_1","tb":"SHOW_1","unit":"SU10565","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000194","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_show_2","tb":"SHOW_1","unit":"SU10566","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"TRAVEL":{"menu":"TRAVEL","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000195","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_travel_1","tb":"TRAVEL_1","unit":"SU10581","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000196","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_travel_2","tb":"TRAVEL_1","unit":"SU10582","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]},"WEDDING":{"menu":"WEDDING","childMenu":"","adType":"singleDom","multiDomAdUrl":"","multiDomUnit":"","infoList":[{"adposId":"1000197","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_wedding_1","tb":"WEDDING_1","unit":"SU10583","calp":"-"},"type":{"position":"abs","positionIndex":4,"subject":"contents"},"dom":null},{"adposId":"1000198","singleDomAdUrl":"https://nv.veta.naver.com/fxshow","param":{"da_dom_id":"p_main_wedding_2","tb":"WEDDING_1","unit":"SU10584","calp":"-"},"type":{"position":"abs","positionIndex":8,"subject":"contents"},"dom":null}]}},
svt: 20200824163254,
}; </script> <script> window.nmain.newsstand = {
rcode: '02285104',
newsCastSubsInfo: '',
newsStandSubsInfo: ''
};
window.etc = {  };
window.svr = "<!--cweb401-->"; </script> <script defer="defer" src="https://ssl.pstatic.net/tveta/libs/assets/js/pc/main/min/pc.veta.core.min.js"></script> <script defer="defer" src="https://ssl.pstatic.net/tveta/libs/assets/js/common/min/probe.min.js"></script> <script crossorigin="anonymous" defer="defer" src="https://pm.pstatic.net/dist/js/nmain.e99a35fe.js?o=www" type="text/javascript"></script> <script crossorigin="anonymous" defer="defer" src="https://pm.pstatic.net/dist/lib/search.jindo.20200326.js?o=www" type="text/javascript"></script> <style>:root{color-scheme:light}#_nx_kbd .setkorhelp a{display:none}</style> </link></meta></meta></meta></meta></meta></head> <body> <div id="u_skip"> <a href="#newsstand"><span>뉴스스탠드 바로가기</span></a> <a href="#themecast"><span>주제별캐스트 바로가기</span></a> <a href="#timesquare"><span>타임스퀘어 바로가기</span></a> <a href="#shopcast"><span>쇼핑캐스트 바로가기</span></a> <a href="#account"><span>로그인 바로가기</span></a> </div> <div id="wrap"> <style type="text/css">._4Jo99iys{position:absolute;top:0;left:0;font-size:14px;line-height:0;letter-spacing:-.25px;color:#000}._4Jo99iys a{vertical-align:top;display:inline-block}._4Jo99iys span,._4Jo99iys strong{line-height:49px}._4Jo99iys strong{text-decoration:underline}._4Jo99iys:before{display:inline-block;content:"";vertical-align:top;background-image:url(https://static-whale.pstatic.net/main/sprite-20200709@2x.png);background-repeat:no-repeat;background-size:98px 83px;width:20px;height:20px;margin:15px 8px 0 0;background-position:-26px -42px}[data-useragent*="MSIE 8"] ._4Jo99iys:before{background-image:url(https://static-whale.pstatic.net/main/sprite-20200709.png)}._4Jo99iys.bRBWJSdg{color:#fff}._4Jo99iys._2UEeAc-c{font-size:17px}._4Jo99iys._2UEeAc-c strong{text-decoration:none}._4Jo99iys._2UEeAc-c:before{content:none}._1syGnXOL{padding-right:18px}._1syGnXOL._3di88A4c{padding-right:12px}._2aeXMlrb{font-size:12px;height:49px;width:78px;text-decoration:none;color:#fff;font-weight:700;letter-spacing:-.5px}._2aeXMlrb span{text-align:center;margin:9px 0;height:31px;display:block;line-height:31px;border-radius:15px}._2aeXMlrb span:before{display:inline-block;content:"";vertical-align:top;background-image:url(https://static-whale.pstatic.net/main/sprite-20200709@2x.png);background-repeat:no-repeat;background-size:98px 83px}[data-useragent*="MSIE 8"] ._2aeXMlrb span:before{background-image:url(https://static-whale.pstatic.net/main/sprite-20200709.png)}._2aeXMlrb.BMgpjddw{font-size:11px;width:94px}._2aeXMlrb.BMgpjddw span:before{margin:9px 3px 0 0;width:17px;height:13px;background-position:-46px -57px}._3h-N8T9V{display:block;height:49px}._3h-N8T9V img{position:absolute;top:0}._1KncATpM{display:inline-block;content:"";vertical-align:top;background-image:url(https://static-whale.pstatic.net/main/sprite-20200709@2x.png);background-repeat:no-repeat;background-size:98px 83px;margin-top:14px;float:left;width:98px;height:21px;background-position:0 -21px}[data-useragent*="MSIE 8"] ._1KncATpM{background-image:url(https://static-whale.pstatic.net/main/sprite-20200709.png)}._1KncATpM._1emt9DIY{background-position:0 0}._20PYt6lT{font-size:11px;height:49px;cursor:pointer;position:absolute;top:0;right:0;color:#666;opacity:.7}._20PYt6lT:after{width:15px;height:15px;margin-left:4px;background-position:0 -68px;display:inline-block;content:"";vertical-align:top;background-image:url(https://static-whale.pstatic.net/main/sprite-20200709@2x.png);background-repeat:no-repeat;background-size:98px 83px}[data-useragent*="MSIE 8"] ._20PYt6lT:after{background-image:url(https://static-whale.pstatic.net/main/sprite-20200709.png)}._20PYt6lT._39oMCV2N:after{background-position:-46px -42px}._20PYt6lT._3wm5EzmJ{color:#fff}._20PYt6lT._3wm5EzmJ:after{background-position:-26px -62px}._1hiMWemA{height:49px}._1hiMWemA .tY_u8r23{position:relative;width:1130px;margin:0 auto}._1hiMWemA .tY_u8r23 a{text-decoration:none}._1hiMWemA._23U_6TM_{position:relative}._1hiMWemA._23U_6TM_:after{position:absolute;z-index:1;content:"";display:block;width:100%;height:1px;bottom:0;background-color:rgba(0,0,0,.05)}</style> <div class="_1hiMWemA _23U_6TM_" data-clk-prefix="top" id="NM_TOP_BANNER" style="background-color:#f9fafe"> <div class="tY_u8r23"> <p class="_4Jo99iys _2UEeAc-c" style="left:432px"> <a class="_1syGnXOL _3di88A4c" data-clk="dropbanner1" href="https://whale.naver.com/details/quicksearch?=main&amp;wpid=RydDy7" tabindex="-1"><span>환율부터 모르는 단어까지, </span><strong>드래그 한 번에 해결!</strong></a> <a class="_2aeXMlrb BMgpjddw" data-clk="dropdownload1" href="http://update.whale.naver.net/downloads/banner/RydDy7/WhaleSetup.exe" id="NM_whale_download_btn"><span style="background-color:#3154b8">다운로드</span></a> </p> <a class="_3h-N8T9V" data-clk="dropbanner1" href="https://whale.naver.com/details/quicksearch?=main&amp;wpid=RydDy7"><i class="_1KncATpM"><span class="blind">NAVER whale</span></i><img alt="" height="49" src="https://static-whale.pstatic.net/main/img_quicksearch@2x.png" style="left:210px" width="210"/></a> <button class="_20PYt6lT _39oMCV2N" data-clk="dropclose1" data-ui-cookie-exp-days="3" data-ui-cookie-key="NM_TOP_PROMOTION" data-ui-cookie-value="1" data-ui-hide-target="#NM_TOP_BANNER" style="display:none" type="button"> 3일 동안 보지 않기 </button> </div> </div> <div id="header" role="banner">
<div class="special_bg">
<div class="group_flex">
<div class="logo_area">
<h1 class="logo_default">
<a class="logo_naver" data-clk="top.logo" href="/"><span class="blind">네이버</span></a>
</h1>
</div>
<div class="service_area">

.... 매우 길다 

 

 

2. html 중 title만 찾기
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.naver.com")
bsObject = BeautifulSoup(html, "html.parser")

print(bsObject.head.title)

태그로 구성된 트리에서 title 태그만 출력합니다.

C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py
<title>NAVER</title>

Process finished with exit code 0

 

이런 식임 

 

다음에 참고한 사이트

http://hleecaster.com/python-web-crawling-with-beautifulsoup/

 

파이썬 웹 크롤링 기초 (BeautifulSoup 사용 방법) - 아무튼 워라밸

본 포스팅에서는 파이썬으로 누구나(?) 따라할 수 있는 웹 크롤링 방법을 소개한다.

hleecaster.com

 

3. html 긁어 모으기 2탄 : requests.get함수  
import requests

webpage = requests.get("https://www.daangn.com/hot_articles")
print(webpage.text)

그래서 우리는 HTML 문서에 담긴 내용을 가져 오도록 request(요청) 해야 한다. 파이썬에는 애초에 requests라는 라이브러리로 편리하게 사용이 가능하다. (만약 설치가 안 되어 있다면 pip를 통해 설치하고 사용하자.)

이제 아래와 같이 requests.get()안에 url을 넣어서 사용할 수 있다. 예를 들어 당근마켓 인기 중고 매물을 가져오고 싶다면 이렇게.

코드를 실행하면 당근마켓 홈 화면의 HTML 문서 전체를 긁어서 출력해준다.

    인기 중고 매물
  </h1>
  <div class="title-line-divider"></div>

  <nav id="hot-articles-navigation">
    <select name="region1" id="region1" onchange="changeRegion(&#39;r1&#39;, this.value)" class="hot-articles-nav-select"><option value="">지역을 선택하세요</option><option value="서울특별시">서울특별시</option>
<option value="부산광역시">부산광역시</option>
<option value="대구광역시">대구광역시</option>
<option value="인천광역시">인천광역시</option>
<option value="광주광역시">광주광역시</option>
<option value="대전광역시">대전광역시</option>
<option value="울산광역시">울산광역시</option>
<option value="세종특별자치시">세종특별자치시</option>
<option value="경기도">경기도</option>
<option value="강원도">강원도</option>
<option value="충청북도">충청북도</option>
<opti
... 이하 생략

 

 

4. beautifulsoup 사용시작 
import requests
from bs4 import BeautifulSoup

webpage = requests.get("https://www.daangn.com/hot_articles")
soup = BeautifulSoup(webpage.content, "html.parser")

print(soup)

위랑 다르게 beatifulsoup을 사용했다. 

일단 from bs4 import BeautifulSoup로 라이브러리를 불러올 수 있다.

그리고 웹페이지를 요청한 뒤, 여기서 받아낸 문서를 .content로 지정한 후 BeautifulSoup를 통해 soup라는 객체로 저장하면 된다.

여기서 뒤에 "html.parser"라고 덧붙였다. 이외에도 "lxml", "html5lib" 등의 옵션을 사용할 수 있으며 각각의 장단점이 있는데, 어쨌든 자세한 내용은 생략하고 (사실 나도 잘 모르니까) 그냥 일단 html로 하자.

    인기 중고 매물
  </h1>
  <div class="title-line-divider"></div>

  <nav id="hot-articles-navigation">
    <select name="region1" id="region1" onchange="changeRegion(&#39;r1&#39;, this.value)" class="hot-articles-nav-select"><option value="">지역을 선택하세요</option><option value="서울특별시">서울특별시</option>
<option value="부산광역시">부산광역시</option>
<option value="대구광역시">대구광역시</option>
<option value="인천광역시">인천광역시</option>
<option value="광주광역시">광주광역시</option>
<option value="대전광역시">대전광역시</option>
<option value="울산광역시">울산광역시</option>
<option value="세종특별자치시">세종특별자치시</option>
<option value="경기도">경기도</option>
<option value="강원도">강원도</option>
<option value="충청북도">충청북도</option>
<opti
... 이하 생략

 

5. 태그 검색하기 : soup.p 옵션 사용 
import requests
from bs4 import BeautifulSoup

webpage = requests.get("https://www.daangn.com/hot_articles")
soup = BeautifulSoup(webpage.content, "html.parser")

print(soup)

soup.p 뒤에 .p옵션을 붙이면 

 

<p>태그만 있는 애를 찾는다.

 

<p>당근마켓 앱에서 따뜻한 거래를 직접 경험해보세요!</p>

 

태그 검색하기다.

 

(위에 두개(C:\Users ~ 랑 Process~~ 는 terminal창에서 항상 나오는 애들 )

 

C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py
<p>당근마켓 앱에서 따뜻한 거래를 직접 경험해보세요!</p>

Process finished with exit code 0

 

실제 웹에서 보이는 <p>옵션

만약 soup.p.string을 쓰면 

<p>태그가 사라진다. 

C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py
당근마켓 앱에서 따뜻한 거래를 직접 경험해보세요!

Process finished with exit code 0

 

6. 태그 검색하기 2탄 : 다른 옵션도 보자
import requests
from bs4 import BeautifulSoup

webpage = requests.get("https://www.daangn.com/hot_articles")
soup = BeautifulSoup(webpage.content, "html.parser")

print(soup.h1)

soup.p 뒤에 .h1 옵션이 붙었다. 

 

C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py
<h1 id="fixed-bar-logo-title">
<a href="https://www.daangn.com/">
<span class="sr-only">당근마켓</span>
<img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-00b7e471b721ce9db8b0758c05a84684413c8aef1ad54caa0f3fcbe7328c947f.svg"/>
</a> </h1>

Process finished with exit code 0

 

 

7. 트리 하위 구조 살펴보기 : child
import requests
from bs4 import BeautifulSoup

webpage = requests.get("https://www.daangn.com/hot_articles")
soup = BeautifulSoup(webpage.content, "html.parser")

for child in soup.ul.children:
    print(child)

그리고 태그는 보통 트리구조로 위계가 있기 때문에 하위 항목을 모두 뽑아오고 싶다면 .children을 사용하면 된다. 예를 들어 ul 태그 안에 리스트가 있다면 이렇게.

 

 

C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py


<li class="footer-list-item"><a class="trust-link" href="/trust">믿을 수 있는 중고거래</a></li>


<li class="footer-list-item"><a class="trust-link" href="/wv/faqs">자주 묻는 질문</a></li>



Process finished with exit code 0

<li class : footer~ > 이 웹에서 보이는 거 

 

 

6. 태그 검색하기 2탄 : 다른 옵션도 보자
for parent in soup.ul.parents:
    print(parent)

당연히 지정된 태그의 상위 항목을 가져올 수도 있다. 이건 .parents를 사용한다. 이건 ul 상위에 있는 body 태그를 출력한 후, 전체 html 까지 추가로 출력한다. 계속 상위로 타고 올라가는 거라 생각하면 된다.

 

 

<div class="card-photo">
<img alt="이케아 수납장(1세트 3쪽)" src="https://dnvefa72aowie.cloudfront.net/origin/article/202008/705ab2fd5a4fdceefdc1777a0d0f8a1b4ce143319ea620d381f224cec878ee3a.webp?q=82&amp;s=300x300&amp;t=crop"/>
</div>
<div class="card-desc">
<h2 class="card-title">이케아 수납장(1세트 3쪽)</h2>
<div class="card-region-name">
        대구 수성구 만촌2동
      </div>
<div class="card-price">
        10,000원
      </div>
<div class="card-counts">
<span>
            관심 15
          </span>
          ∙
          <span>
            채팅 37
          </span>
</div>
for d in soup.div.children:
    print(d)
C:\Users\user\PycharmProjects\untitled4\venv\Scripts\python.exe C:/Users/user/PycharmProjects/untitled4/next.py


<h1 id="fixed-bar-logo-title">
<a href="https://www.daangn.com/">
<span class="sr-only">당근마켓</span>
<img alt="당근마켓" class="fixed-logo" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/logo-basic-00b7e471b721ce9db8b0758c05a84684413c8aef1ad54caa0f3fcbe7328c947f.svg"/>
</a> </h1>


<section id="fixed-bar-search">
<div class="search-input-wrap">
<span class="sr-only">검색</span>
<input class="fixed-search-input" id="header-search-input" name="header-search-input" placeholder="지역, 상품 등을 검색해보세요." type="text"/>
<button id="header-search-button">
<img alt="Search" class="fixed-search-icon" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/search-icon-db20a2e9e6b0fc922b44982d451cf1c967c86e8e8df270e71c300832a6f31f1a.svg"/>
</button>
</div>
</section>


<section id="fixed-bar-download">
<h3 class="hide">다운로드</h3>
<a class="fixed-download-button" href="https://itunes.apple.com/kr/app/pangyojangteo/id1018769995?l=ko&amp;ls=1&amp;mt=8" id="header-download-button-ios" target="_blank">
<img alt="App Store" class="fixed-apple-store" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/apple-store-790f526e762ae58ec39274857ea3434959b7bd40fc23ec1f33a21983f3d024ff.svg"/>
<div class="fixed-download-text">App Store</div>
</a> <a class="fixed-download-button" href="https://play.google.com/store/apps/details?id=com.towneers.www" id="header-download-button-android" target="_blank">
<img alt="Google Play" class="fixed-google-play" src="https://d1unjqcospf8gs.cloudfront.net/assets/home/base/header/google-play-3c1802269ac6bedde598de4f2885286c18492748e5b58bd358254b26ee61e008.svg"/>
<div class="fixed-download-text">Google Play</div>
</a> </section>



Process finished with exit code 0

 

반응형