要提取http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb各網址和標題,能正確提取網頁源代碼,但結果為空串列
import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
url = 'http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb'
res = requests.get(url, headers=headers).text
print(res)
p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*? ;" fbfw="外">.*?</a>.*?</li> '
href = re.findall(p_href, res, re.S)
p_title = 'href="https://bbs.csdn.net/topics/.*? " fbfw="外">(.*?)</a>'
title = re.findall(p_title, res, re.S)
print(href)
print(title)
uj5u.com熱心網友回復:
1.對于html來說,其實不太建議用re正則去匹配。建議用BeautifulSoup去匹配Python專題教程:BeautifulSoup詳解
2.非要匹配的話,我幫你除錯出匹配的代碼了
# Function: python正則運算式提取空串列-CSDN論壇
# https://bbs.csdn.net/topics/395845984
# Author: Crifan Li
# Update: 20200215
import codecs
import requests
import re
# headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
# url = 'http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb'
# res = requests.get(url, headers=headers).text
# print(res)
respHtmlFile = "responseHtml.html"
# with open(respHtmlFile, "w") as htmlFp:
# htmlFp.write(res)
# htmlFp.close()
def loadTextFromFile(fullFilename, fileEncoding="utf-8"):
"""load file text content from file"""
with codecs.open(fullFilename, 'r', encoding=fileEncoding) as fp:
allText = fp.read()
# logging.debug("Complete load text from %s", fullFilename)
return allText
res = loadTextFromFile(respHtmlFile)
"""
<li class="clearfix"><span>2020-12-31</span>
<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>
</li>
"""
# p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*? ;" fbfw="外">.*?</a>.*?</li> '
# p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*?;" fbfw="外">.*?</a>.*?</li>'
# p_href = '<li class="clearfix"><span>.+?</span>\s*<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*?;" fbfw="外">.*?</a>\s*</li>'
# p_href = 'href="javascript:void\(0\)" style="color:.*?;" fbfw="外">.*?</a>\s*</li>'
# p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void\(0\)" style="color:.*? ;" fbfw="外">.*?</a>.*?</li> '
# p_href = '<li class="clearfix"><span>.+?</span>\s*<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void\(0\)" style="color:.*?;" fbfw="外">.*?</a>\s*</li>'
p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void\(0\)" style="color:.*?;" fbfw="外">.*?</a>.*?</li>'
href = re.findall(p_href, res, re.S)
p_title = 'href="https://bbs.csdn.net/topics/.*? " fbfw="外">(.*?)</a>'
title = re.findall(p_title, res, re.S)
print(href)
print(title)
其中:
你的正則,主體沒問題,但是有2個小的小失誤:
(1)
style="color:.*? ;"
多了個空格, 應該是:
style="color:.*?;"
(2)
href="javascript:void(0)"
沒有把括號轉義,應該是:
href="javascript:void\(0\)"
uj5u.com熱心網友回復:
關于正則,可參考我寫的應用廣泛的超強搜索:正則運算式
Python中正則運算式:re模塊詳解
uj5u.com熱心網友回復:
后來整理出BeautifulSoup的完整教程了:網頁決議利器:BeautifulSoup
其中包括,把你此處的需求,用 正則re 和 bs4,都完整實作了一遍
代碼如下:
# Function: 通過對比說明如何用BeautifulSoup和正則re去提取html中的內容
# 舉例所用需求來自此帖:
# python正則運算式提取空串列-CSDN論壇
# https://bbs.csdn.net/topics/395845984
# 后已經整理至教程:
# 網頁決議利器:BeautifulSoup
# http://book.crifan.com/books/html_parse_tool_beautifulsoup/website
# Author: Crifan Li
# Update: 20200216
import codecs
from bs4 import BeautifulSoup
import re
respHtmlFile = "responseHtml.html"
# 第一次:初始化,保存html到檔案
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
url = 'http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb'
respHtml = requests.get(url, headers=headers).text
# save html to file for later debug
with open(respHtmlFile, "w") as htmlFp:
htmlFp.write(respHtml)
htmlFp.close()
# # 后續除錯:從檔案中讀取html代碼,方便除錯
# def loadTextFromFile(fullFilename, fileEncoding="utf-8"):
# """load file text content from file"""
# with codecs.open(fullFilename, 'r', encoding=fileEncoding) as fp:
# allText = fp.read()
# # logging.debug("Complete load text from %s", fullFilename)
# return allText
# respHtml = loadTextFromFile(respHtmlFile)
"""
【要處理的html的原始碼】
<li class="clearfix"><span>2020-12-31</span>
<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>
</li>
<li class="clearfix"><span>2020-12-31</span>
<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=41174064&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">學術就業相關資訊————清華大學學生職業發展指導中心</a>
</li>
...
【背景解釋】
上述html元素結構是:
li
span
a
中文文字
【需求說明】
假如要提取的是:
每個li中a的:
ahref的鏈接地址
中文文字
"""
################################################################################
# 用正則re提取html內容
################################################################################
print("="*20, "用正則re提取html內容", "="*20)
print("="*10, "方式1:用re.findall一次性找2個值", "="*10)
# 方式1:匹配整個li的部分
wholeLiP = '<li\s+class="clearfix"><span>.*?</span>\s*<a\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+href="javascript:void\(0\)"\s+style="color:.*?;"\s+fbfw="外">(.*?)</a>\s*</li>'
print("wholeLiP=%s" % wholeLiP)
foundAllMatchedTupleList = re.findall(wholeLiP, respHtml, re.S)
# # 方式2:只匹配a的部分
# onlyAP = '<a\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+href="javascript:void\(0\)"\s+style="color:.*?;"\s+fbfw="外">(.*?)</a>'
# print("onlyAP=%s" % onlyAP)
# foundAllMatchedTupleList = re.findall(onlyAP, respHtml, re.S)
# print("foundAllMatchedTupleList=%s" % foundAllMatchedTupleList)
for curIdx, eachTuple in enumerate(foundAllMatchedTupleList):
# 之前正則中有2個括號,對應2個group組:ahref="https://bbs.csdn.net/topics/(.*?)",和 >(.*?)</a>
# -》此處匹配到的值是個tuple元素,是2個元素,分別對應著之前的2個group
print("-"*10, "[%d]" % curIdx, "-"*10)
print("type(eachTuple)=%s" % type(eachTuple))
# type(eachTuple)=<class 'tuple'>
(matchedFirstGroupStr, matchedSecondGroupStr) = eachTuple
ahrefValue = matchedFirstGroupStr
contentStrValue = matchedSecondGroupStr
print("ahrefValue=https://bbs.csdn.net/topics/%s, contentStrValue=%s" % (ahrefValue, contentStrValue))
# ahrefValue=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=, contentStrValue=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總
print("="*10, "方式2:用re.finditer找,支持更多可能性", "="*10)
foundAllMatchObjectList = re.finditer(wholeLiP, respHtml, re.S)
# foundAllMatchObjectList=<callable_iterator object at 0x10f4274e0>
print("foundAllMatchObjectList=%s" % foundAllMatchObjectList)
for curIdx, eachMatchObject in enumerate(foundAllMatchObjectList):
print("-"*10, "[%d]" % curIdx, "-"*10)
# re.finditer回傳的是Match Objects的list
print("type(eachMatchObject)=%s" % type(eachMatchObject))
# type(eachMatchObject)=<class 're.Match'>
group1Value = eachMatchObject.group(1)
group2Value = eachMatchObject.group(2)
ahrefValue = group1Value
contentStrValue = group2Value
print("ahrefValue=https://bbs.csdn.net/topics/%s, contentStrValue=%s" % (ahrefValue, contentStrValue))
# ahrefValue=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=, contentStrValue=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總
# 額外說明:
# 如果你前面正則中是named group帶命名的組,比如:
# ... ahref="https://bbs.csdn.net/topics/(?P.*?)" ... >(?P<contentStr>.*?)</a>
# 那么也可以通過group name組名去獲取值:
# ahrefValue = eachMatchObject.group("ahref")
# contentStrValue = eachMatchObject.group("contentStr")
################################################################################
# 用BeautifulSoup提取html內容
################################################################################
print("="*20, "用BeautifulSoup提取html內容", "="*20)
soup = BeautifulSoup(respHtml, 'html.parser')
# print("soup=%s" % soup)
print("="*10, "方式1:先找外層的li,再去li中找其中的a", "="*10)
# 找li的方式1:通過attrs指定屬性
# foundLiList = soup.find_all('li', attrs={"class": "clearfix"})
# 找li的方式2:class 在Python中是保留字 -> BeautifulSoup >4.1.1后,用class_指定CSS的類名
foundLiList = soup.find_all('li', class_="clearfix")
# print("foundLiList=%s" % foundLiList)
for curIdx, eachLi in enumerate(foundLiList):
print("-"*10, "[%d]" % curIdx, "-"*10)
print("type(eachLi)=%s" % type(eachLi))
# type(eachLi)=<class 'bs4.element.Tag'>
foundA = eachLi.find("a", attrs={"fbfw":"外"})
print("foundA=%s" % foundA)
# foundA=<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" fbfw="外" href="javascript:void(0)" style="color:#ff0000;">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>
if foundA:
ahref = foundA["ahref"]
print("ahref=https://bbs.csdn.net/topics/%s" % ahref)
# ahref=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=
contentStr = foundA.string
print("contentStr=%s" % contentStr)
# contentStr=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總
print("="*10, "方式2:直接找a,加上限定條件", "="*10)
# foundAList = soup.find_all('a', attrs={"fbfw":"外"}) # 只加上一個fbfw的限定條件,此處也是可以的
ahrefNonEmptyP = re.compile("\S+") # ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type="
print("ahrefNonEmptyP=%s" % ahrefNonEmptyP)
# foundAList = soup.find_all('a', attrs={"fbfw":"外", "ahref": ahrefNonEmptyP})
styleColorP = re.compile("color:#[a-zA-Z0-9]+;") # style="color:#ff0000;"
print("styleColorP=%s" % styleColorP)
foundAList = soup.find_all('a', attrs={"fbfw":"外", "ahref": ahrefNonEmptyP, "style": styleColorP})
# print("foundAList=%s" % foundAList)
for curIdx, eachA in enumerate(foundAList):
print("-"*10, "[%d]" % curIdx, "-"*10)
print("type(eachA)=%s" % type(eachA))
# type(eachA)=<class 'bs4.element.Tag'>
ahref = eachA["ahref"]
print("ahref=https://bbs.csdn.net/topics/%s" % ahref)
# ahref=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=
contentStr = eachA.string
print("contentStr=%s" % contentStr)
# contentStr=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總
################################################################################
# 對比:re vs BeautifulSoup
################################################################################
print("="*20, "對比:re vs BeautifulSoup", "="*20)
reVsBeautifulSoup = """
re正則的缺點:
萬一html源代碼改動了,即使改動很小,則之前已有的re正則運算式就失效了
舉例:
只是a的屬性的順序變化一點點
從
<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>
改為:
<a href="javascript:void(0)" ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" fbfw="外" style="color:#ff0000;">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>
之前正則:
'<a\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+href="javascript:void\(0\)"\s+style="color:.*?;"\s+fbfw="外">(.*?)</a>'
就無效了,就要再去改為:
'<a\s+href="javascript:void\(0\)"\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+fbfw="外"\s+style="color:.*?;">(.*?)</a>'
才可以匹配到。
更別說,萬一html中代碼有其他更大的變化
甚至是部分語法不規范的html代碼,re正則根本就沒法寫,因為太復雜,復雜到寫不出來
BeautifulSoup的優點:
與之相對:上述的,html代碼的小改動,比如屬性值出現的順序不同
甚至大點的變化,多出其他屬性值
甚至部分語法不規范的html代碼,BeautifulSoup都可以很好的內部處理掉
而之前的代碼,比如:
soup.find_all('a', attrs={"fbfw":"外", "ahref": nonEmptyP})
都可以很好的繼續作業,而無需改動。
匯總起來就是:
re
性能:好
支持html程度:有限
僅限于不是很復雜的,比較規整的html
BeautifulSoup
性能:中等
支持html程度:很好
不僅支持復雜的html,還支持html內部元素和位置變化
對于不規范的html也有很好的支持
"""
print(reVsBeautifulSoup)
相關教程是:
BeautifulSoup和re詳細對比
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/92308.html
