Python爬蟲：BeatifulSoap決議HTML報文的三個實用技巧-有解無憂

? ? 老猿Python博文目錄：https://blog.csdn.net/LaoYuanPython ?

一、BeautifulSoup簡介

BeautifulSoup是Python爬蟲應用決議Html的利器，是Python三方模塊bs4中提供的進行HTML決議的類，可以認為是一個HTML決議工具箱，對HTML報文中的標簽具有比較好的容錯識別功能，lxml是一款html文本決議器，BeautifulSoup構建物件時需要指定HTML決議器，推薦使用lxml，

BeautifulSoup和lxml安裝命令：

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple bs4
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple lxml

加載BeautifulSoup：

from bs4 import BeautifulSoup

BeatifulSoap決議HTML報文的常用功能：

通過BeautifulSoup物件可以訪問標簽對應的html元素、并進一步訪問標簽的名字、屬性、html元素標簽對中的內容，
案例：

from bs4 import BeautifulSoup
import urllib.request
def getURLinf(url): 
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
    req = urllib.request.Request(url=url,headers=header)
    resp = urllib.request.urlopen(req,timeout=5)
    html = resp.read().decode()
  
    soup = BeautifulSoup(html,'lxml')
    return (soup,req,resp) 
    
soup,req ,resp  = getURLinf(r'https://blog.csdn.net/LaoYuanPython/article/details/111303395')

print(soup.p)
print(soup.link)
print(soup.title)
print(soup.link.attrs)
print(soup.link['rel'])

通過標簽的contents屬性，可以訪問其下嵌套的所有下級HTML元素，這些該標簽下的子標簽對應的HTML元素放到一個contents 指向的串列中，
如：print(soup.body.contents)
可以訪問標簽對應的父、子、兄弟及祖先標簽資訊；
使用strings屬性迭代訪問除標簽外的所有內容；
可以使用find、find_all、find_parent、find_parents等系列方法查找滿足特定條件的標簽；
使用select通過css選擇器定位特定標簽，

具體的大家可以參考老猿博客的免費專欄《爬蟲：https://blog.csdn.net/laoyuanpython/category_9103810.html》或付費專欄《Python爬蟲入門：https://blog.csdn.net/laoyuanpython/category_10762553.html》的相關介紹，

二、一些決議技巧

在HTML決議時，如果通過簡單的tag、或單個tag屬性（如id、class）或文本一次搜索或select定位是最簡單的，而有些情況需要使用組合方法才能處理，

2.1、通過標簽的多個屬性組合定位或查找

經常有些要定位的標簽有很多，按單個屬性查找也有很多，得使用多個屬性查找，如：

<div id="article_content" class="article_content clearfix">
......
</div>
<div id="article_content" class="article_view">
......
</div>
<div id="article_view" class="article_view">
......
</div>

上面的html文本中有多個id為article_content的div標簽，如果使用：

>>> text="""```html
<div id="article_content" class="article_content clearfix">
......
</div>
<div id="article_content" class="article_view">
......
</div>
<div id="article_view" class="article_view">
......
</div>"""
>>> s = BeautifulSoup(text,'lxml')
>>> s.select('div#article_content')
[<div class="article_content clearfix" id="article_content">......</div>, 
<div class="article_view" id="article_content">......</div>]
>>>

就會回傳兩條記錄，這時候就可以使用多標簽屬性定位的如下4種陳述句：

>>>s.select('div#article_content[class="article_content clearfix"]')
[<div class="article_content clearfix" id="article_content">......</div>]
>>>s.select('div[id="article_content"][class="article_content clearfix"]')
[<div class="article_content clearfix" id="article_content">......</div>]
>>>s.find_all("div",id="article_content",class_='article_content clearfix')
[<div class="article_content clearfix" id="article_content">......</div>]
>>>s.find_all("div","#article_content",class_='article_content clearfix')
[<div class="article_content clearfix" id="article_content">......</div>]

以上四種方式是等價的，因為id可以用#來標記，class在查找時需要和Python關鍵字class區分，因此有上述不同方法，注意select的每個屬性必須用中括號括起來，不同屬性的中括號之間不能有空格，如果有空格表示的就不是查找同一標簽的屬性，空格后的屬性表示前一個屬性對應標簽的子孫標簽的屬性，

2.2、利用tag標簽關系定位內容

tag標簽關系包括父子、兄弟、祖先等關系，有時要查找或定位的內容本身不是很好定位，但結合其他標簽關系（主要是父子、祖先關系）則可以唯一確認，
案例：
這是CSDN的博文中關于博主個人資訊的部分報文：

<div class="data-info d-flex item-tiling">
               <dl class="text-center" title="1055">
                   <a href="https://blog.csdn.net/LaoYuanPython" data-report-click='{"mod":"1598321000_001","spm":"1001.2101.3001.4310"}' data-report-query="t=1">
                       <dt><span class="count">1055</span></dt>
                       <dd class="font">原創</dd>
                   </a>
               </dl>
               <dl class="text-center" data-report-click='{"mod":"1598321000_002","spm":"1001.2101.3001.4311"}' title="22">
                   <a href="https://blog.csdn.net/rank/writing_rank" target="_blank">
                       <dt><span class="count">22</span></dt>
                       <dd class="font">周排名</dd>
                   </a>
               </dl>
           </div>

以上報文中，如果要取博主的原創文章數和周排名，原創文章數和博主周排名的tag標簽完全相同，二者都在span標簽內，標簽的屬性及值都相同，只是span標簽的父標簽dt標簽的兄弟標簽dd標簽的string的中文內容才能區分，對于這種情況，首先要通過祖先標簽<div class="data-info d-flex item-tiling">定位到祖先標簽，再在祖先標簽內通過中文字串定位到要訪問屬性的兄弟標簽的子標簽，然后通過該子標簽找到其父標簽的父標簽，再通過該父標簽的dt子標簽的span子標簽訪問具體取值，

示例代碼如下：

>>> text="""
<div class="data-info d-flex item-tiling">
               <dl class="text-center" title="1055">
                   <a href="https://blog.csdn.net/LaoYuanPython" data-report-click='{"mod":"1598321000_001","spm":"1001.2101.3001.4310"}' data-report-query="t=1">
                       <dt><span class="count">1055</span></dt>
                       <dd class="font">原創</dd>
                   </a>
               </dl>
               <dl class="text-center" data-report-click='{"mod":"1598321000_002","spm":"1001.2101.3001.4311"}' title="22">
                   <a href="https://blog.csdn.net/rank/writing_rank" target="_blank">
                       <dt><span class="count">22</span></dt>
                       <dd class="font">周排名</dd>
                   </a>
               </dl>
           </div>"""
>>> s = BeautifulSoup(text,'lxml')
>>> subSoup = s.select('[class="data-info d-flex item-tiling"] [class="font"]')
>>> for item in subSoup:
           parent = item.parent
           if item.string=='原創':
               orignalNum = int(parent.select('.count')[0].string)
           elif item.string=='周排名':
               weekRank =  int(parent.select('.count')[0].string)

               
>>> print(orignalNum,weekRank)
1055 22
>>>

注意：上面的select使用的也是標簽的屬性來定位標簽，并且兩個中括號之間有空格，表明后一個要查找的標簽在前一個屬性對應標簽的子孫標簽范圍內，

2.3、分析前去除程式代碼避免干擾

在決議HTML報文時，絕大多數情況是需要分析有用的標簽資訊，但作為技術文章，大部分的博文中都有代碼，這些代碼可能會對分析進行干擾，如本文中的代碼含有一些分析的HTML報文，如果獲取本文的完整HTML內容，這些報文在非代碼部分也會出現，此時要排除代碼的影響，可以將代碼先從分析內容中去除再來分析，

目前大多數技術平臺的博文編輯器都支持對代碼的標識，象markdown等編輯器代碼的標簽為code標檢，如果有其他編輯器用不同標簽的，只有確認了標簽名，都可以按下面介紹的類似方式來處理，
處理步驟如下：

獲取報文；
構建BeatifulSoap物件soup；
通過soup.code.extract()或soup.code.decompose（）方式就從soup物件中去除了代碼部分，decompose方法與extract方法的區別就是decompose直接洗掉對應物件資料而extract再洗掉時將洗掉物件單獨回傳，

關于這部分內容的案例可以參考《https://blog.csdn.net/LaoYuanPython/article/details/114729045 n行Python代碼系列：四行程式分離HTML報文中的程式代碼》的詳細介紹，

三、小結

本文介紹了使用BeatifulSoap決議HTML報文的三個使用技巧，包括通過多屬性組合查找或定位標簽、通過結合多個標簽關系來定位標簽以及去除html報文中的代碼標簽來避免代碼對決議的影響，

寫博不易，敬請支持：

如果閱讀本文于您有所獲，敬請點贊、評論、收藏，謝謝大家的支持！

關于老猿的付費專欄

付費專欄《https://blog.csdn.net/laoyuanpython/category_9607725.html 使用PyQt開發圖形界面Python應用》專門介紹基于Python的PyQt圖形界面開發基礎教程，對應文章目錄為《 https://blog.csdn.net/LaoYuanPython/article/details/107580932 使用PyQt開發圖形界面Python應用專欄目錄》；
付費專欄《https://blog.csdn.net/laoyuanpython/category_10232926.html moviepy音視頻開發專欄 )詳細介紹moviepy音視頻剪輯合成處理的類相關方法及使用相關方法進行相關剪輯合成場景的處理，對應文章目錄為《https://blog.csdn.net/LaoYuanPython/article/details/107574583 moviepy音視頻開發專欄文章目錄》；
付費專欄《https://blog.csdn.net/laoyuanpython/category_10581071.html OpenCV-Python初學者疑難問題集》為《https://blog.csdn.net/laoyuanpython/category_9979286.html OpenCV-Python圖形影像處理》的伴生專欄，是筆者對OpenCV-Python圖形影像處理學習中遇到的一些問題個人感悟的整合，相關資料基本上都是老猿反復研究的成果，有助于OpenCV-Python初學者比較深入地理解OpenCV，對應文章目錄為《https://blog.csdn.net/LaoYuanPython/article/details/109713407 OpenCV-Python初學者疑難問題集專欄目錄》
付費專欄《https://blog.csdn.net/laoyuanpython/category_10762553.html Python爬蟲入門》站在一個互聯網前端開發小白的角度介紹爬蟲開發應知應會內容，包括爬蟲入門的基礎知識，以及爬取CSDN文章資訊、博主資訊、給文章點贊、評論等實戰內容，

前兩個專欄都適合有一定Python基礎但無相關知識的小白讀者學習，第三個專欄請大家結合《https://blog.csdn.net/laoyuanpython/category_9979286.html OpenCV-Python圖形影像處理》的學習使用，

對于缺乏Python基礎的同仁，可以通過老猿的免費專欄《https://blog.csdn.net/laoyuanpython/category_9831699.html 專欄：Python基礎教程目錄）從零開始學習Python，

如果有興趣也愿意支持老猿的讀者，歡迎購買付費專欄，

跟老猿學Python！

? ? 前往老猿Python博文目錄 https://blog.csdn.net/LaoYuanPython ?

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/275800.html

標籤：python

上一篇：Python生成九宮格圖片

下一篇：量化交易策略評估指標計算