python正則運算式提取空串列-有解無憂

要提取http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb各網址和標題，能正確提取網頁源代碼，但結果為空串列
import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
url = 'http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb'
res = requests.get(url, headers=headers).text
print(res)
p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*? ;" fbfw="外">.*?</a>.*?</li> '
href = re.findall(p_href, res, re.S)
p_title = 'href="https://bbs.csdn.net/topics/.*? " fbfw="外">(.*?)</a>'
title = re.findall(p_title, res, re.S)
print(href)
print(title)

uj5u.com熱心網友回復：

1.對于html來說，其實不太建議用re正則去匹配。建議用BeautifulSoup去匹配
Python專題教程：BeautifulSoup詳解
2.非要匹配的話，我幫你除錯出匹配的代碼了



# Function: python正則運算式提取空串列-CSDN論壇

#   https://bbs.csdn.net/topics/395845984

# Author: Crifan Li

# Update: 20200215



import codecs



import requests

import re



# headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}

# url = 'http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb'

# res = requests.get(url, headers=headers).text

# print(res)



respHtmlFile = "responseHtml.html"



# with open(respHtmlFile, "w") as htmlFp:

#   htmlFp.write(res)

#   htmlFp.close()



def loadTextFromFile(fullFilename, fileEncoding="utf-8"):

    """load file text content from file"""

    with codecs.open(fullFilename, 'r', encoding=fileEncoding) as fp:

        allText = fp.read()

        # logging.debug("Complete load text from %s", fullFilename)

        return allText



res = loadTextFromFile(respHtmlFile)



"""

<li class="clearfix"><span>2020-12-31</span>

                                                                                                                                                                                                                                                                          

                                                                                                                                                                                                                                              

      <a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>

                                                                                                                                                                                                                                                                          

                                                  

    </li> 

"""



# p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*? ;" fbfw="外">.*?</a>.*?</li> '

# p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*?;" fbfw="外">.*?</a>.*?</li>'

# p_href = '<li class="clearfix"><span>.+?</span>\s*<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void(0)" style="color:.*?;" fbfw="外">.*?</a>\s*</li>'

# p_href = 'href="javascript:void\(0\)" style="color:.*?;" fbfw="外">.*?</a>\s*</li>'

# p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void\(0\)" style="color:.*? ;" fbfw="外">.*?</a>.*?</li> '

# p_href = '<li class="clearfix"><span>.+?</span>\s*<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void\(0\)" style="color:.*?;" fbfw="外">.*?</a>\s*</li>'

p_href = '<li class="clearfix"><span>.*?</span>.*?<a ahref="https://bbs.csdn.net/topics/(.*?)" href="javascript:void\(0\)" style="color:.*?;" fbfw="外">.*?</a>.*?</li>'

href = re.findall(p_href, res, re.S)

p_title = 'href="https://bbs.csdn.net/topics/.*? " fbfw="外">(.*?)</a>'

title = re.findall(p_title, res, re.S)

print(href)

print(title)

其中：
你的正則，主體沒問題，但是有2個小的小失誤：
（1）

style="color:.*? ;"

多了個空格，應該是：

style="color:.*?;"

（2）

href="javascript:void(0)"

沒有把括號轉義，應該是：

href="javascript:void\(0\)"

uj5u.com熱心網友回復：

關于正則，可參考我寫的

應用廣泛的超強搜索：正則運算式

Python中正則運算式：re模塊詳解

uj5u.com熱心網友回復：

后來整理出BeautifulSoup的完整教程了：
網頁決議利器：BeautifulSoup

其中包括，把你此處的需求，用正則re 和 bs4，都完整實作了一遍

代碼如下：

# Function: 通過對比說明如何用BeautifulSoup和正則re去提取html中的內容

#   舉例所用需求來自此帖：

#     python正則運算式提取空串列-CSDN論壇

#     https://bbs.csdn.net/topics/395845984

#   后已經整理至教程：

#     網頁決議利器：BeautifulSoup

#     http://book.crifan.com/books/html_parse_tool_beautifulsoup/website

# Author: Crifan Li

# Update: 20200216



import codecs

from bs4 import BeautifulSoup

import re



respHtmlFile = "responseHtml.html"



# 第一次：初始化，保存html到檔案

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}

url = 'http://career.cic.tsinghua.edu.cn/xsglxt/f/jyxt/anony/xxfb'

respHtml = requests.get(url, headers=headers).text

# save html to file for later debug

with open(respHtmlFile, "w") as htmlFp:

  htmlFp.write(respHtml)

  htmlFp.close()



# # 后續除錯：從檔案中讀取html代碼，方便除錯

# def loadTextFromFile(fullFilename, fileEncoding="utf-8"):

#   """load file text content from file"""

#   with codecs.open(fullFilename, 'r', encoding=fileEncoding) as fp:

#     allText = fp.read()

#     # logging.debug("Complete load text from %s", fullFilename)

#     return allText

# respHtml = loadTextFromFile(respHtmlFile)



"""

【要處理的html的原始碼】



	<li class="clearfix"><span>2020-12-31</span>

		                                                                                                                                                                                                                                            

  			<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>

 			                                             

        </li>                                                                                                                                                                                                                                                                        

	                                                                                                                                                                                                                    

	<li class="clearfix"><span>2020-12-31</span>

		                                                                                                                                                                                                                                            

  			<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=41174064&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">學術就業相關資訊————清華大學學生職業發展指導中心</a>

 			                                             

        </li>   

  ...



【背景解釋】

上述html元素結構是：

li

  span

  a

    中文文字



【需求說明】

假如要提取的是：

每個li中a的：

  ahref的鏈接地址

  中文文字

"""



################################################################################

# 用正則re提取html內容

################################################################################

print("="*20, "用正則re提取html內容", "="*20)



print("="*10, "方式1：用re.findall一次性找2個值", "="*10)



# 方式1：匹配整個li的部分

wholeLiP = '<li\s+class="clearfix"><span>.*?</span>\s*<a\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+href="javascript:void\(0\)"\s+style="color:.*?;"\s+fbfw="外">(.*?)</a>\s*</li>'

print("wholeLiP=%s" % wholeLiP)



foundAllMatchedTupleList = re.findall(wholeLiP, respHtml, re.S)



# # 方式2：只匹配a的部分

# onlyAP = '<a\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+href="javascript:void\(0\)"\s+style="color:.*?;"\s+fbfw="外">(.*?)</a>'

# print("onlyAP=%s" % onlyAP)

# foundAllMatchedTupleList = re.findall(onlyAP, respHtml, re.S)



# print("foundAllMatchedTupleList=%s" % foundAllMatchedTupleList)

for curIdx, eachTuple in enumerate(foundAllMatchedTupleList):

  # 之前正則中有2個括號，對應2個group組：ahref="https://bbs.csdn.net/topics/(.*?)"，和 >(.*?)</a>

  # -》此處匹配到的值是個tuple元素，是2個元素，分別對應著之前的2個group

  print("-"*10, "[%d]" % curIdx, "-"*10)

  print("type(eachTuple)=%s" % type(eachTuple))

  # type(eachTuple)=<class 'tuple'>

  (matchedFirstGroupStr, matchedSecondGroupStr) = eachTuple

  ahrefValue = matchedFirstGroupStr

  contentStrValue = matchedSecondGroupStr

  print("ahrefValue=https://bbs.csdn.net/topics/%s, contentStrValue=%s" % (ahrefValue, contentStrValue))

  # ahrefValue=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=, contentStrValue=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總



print("="*10, "方式2：用re.finditer找，支持更多可能性", "="*10)



foundAllMatchObjectList = re.finditer(wholeLiP, respHtml, re.S)

# foundAllMatchObjectList=<callable_iterator object at 0x10f4274e0>

print("foundAllMatchObjectList=%s" % foundAllMatchObjectList)

for curIdx, eachMatchObject in enumerate(foundAllMatchObjectList):

  print("-"*10, "[%d]" % curIdx, "-"*10)

  # re.finditer回傳的是Match Objects的list

  print("type(eachMatchObject)=%s" % type(eachMatchObject))

  # type(eachMatchObject)=<class 're.Match'>

  group1Value = eachMatchObject.group(1)

  group2Value = eachMatchObject.group(2)

  ahrefValue = group1Value

  contentStrValue = group2Value

  print("ahrefValue=https://bbs.csdn.net/topics/%s, contentStrValue=%s" % (ahrefValue, contentStrValue))

  # ahrefValue=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=, contentStrValue=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總



  # 額外說明：

  # 如果你前面正則中是named group帶命名的組，比如：

  # ... ahref="https://bbs.csdn.net/topics/(?P.*?)" ... >(?P<contentStr>.*?)</a>

  # 那么也可以通過group name組名去獲取值：

  # ahrefValue = eachMatchObject.group("ahref")

  # contentStrValue = eachMatchObject.group("contentStr")



################################################################################

# 用BeautifulSoup提取html內容

################################################################################

print("="*20, "用BeautifulSoup提取html內容", "="*20)



soup = BeautifulSoup(respHtml, 'html.parser')

# print("soup=%s" % soup)



print("="*10, "方式1：先找外層的li，再去li中找其中的a", "="*10)



# 找li的方式1：通過attrs指定屬性

# foundLiList = soup.find_all('li', attrs={"class": "clearfix"})

# 找li的方式2：class 在Python中是保留字 -> BeautifulSoup >4.1.1后，用class_指定CSS的類名

foundLiList = soup.find_all('li', class_="clearfix")

# print("foundLiList=%s" % foundLiList)

for curIdx, eachLi in enumerate(foundLiList):

  print("-"*10, "[%d]" % curIdx, "-"*10)

  print("type(eachLi)=%s" % type(eachLi))

  # type(eachLi)=<class 'bs4.element.Tag'>

  foundA = eachLi.find("a", attrs={"fbfw":"外"})

  print("foundA=%s" % foundA)

  # foundA=<a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" fbfw="外" href="javascript:void(0)" style="color:#ff0000;">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>

  if foundA:

    ahref = foundA["ahref"]

    print("ahref=https://bbs.csdn.net/topics/%s" % ahref)

    # ahref=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=

    contentStr = foundA.string

    print("contentStr=%s" % contentStr)

    # contentStr=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總



print("="*10, "方式2：直接找a，加上限定條件", "="*10)



# foundAList = soup.find_all('a', attrs={"fbfw":"外"}) # 只加上一個fbfw的限定條件，此處也是可以的

ahrefNonEmptyP = re.compile("\S+") # ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type="

print("ahrefNonEmptyP=%s" % ahrefNonEmptyP)

# foundAList = soup.find_all('a', attrs={"fbfw":"外", "ahref": ahrefNonEmptyP})

styleColorP = re.compile("color:#[a-zA-Z0-9]+;") # style="color:#ff0000;"

print("styleColorP=%s" % styleColorP)

foundAList = soup.find_all('a', attrs={"fbfw":"外", "ahref": ahrefNonEmptyP, "style": styleColorP})

# print("foundAList=%s" % foundAList)

for curIdx, eachA in enumerate(foundAList):

  print("-"*10, "[%d]" % curIdx, "-"*10)

  print("type(eachA)=%s" % type(eachA))

  # type(eachA)=<class 'bs4.element.Tag'>

  ahref = eachA["ahref"]

  print("ahref=https://bbs.csdn.net/topics/%s" % ahref)

  # ahref=https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=

  contentStr = eachA.string

  print("contentStr=%s" % contentStr)

  # contentStr=2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總



################################################################################

# 對比：re vs BeautifulSoup

################################################################################

print("="*20, "對比：re vs BeautifulSoup", "="*20)



reVsBeautifulSoup = """

re正則的缺點：

萬一html源代碼改動了，即使改動很小，則之前已有的re正則運算式就失效了

舉例：

只是a的屬性的順序變化一點點

從

  <a ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" href="javascript:void(0)" style="color:#ff0000;" fbfw="外">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>

改為：

  <a href="javascript:void(0)" ahref="https://bbs.csdn.net/xsglxt/f/jyxt/anony/showZwxx?zpxxid=104719161&type=" fbfw="外" style="color:#ff0000;">2019-2020年度全國各地選調生招錄、事業單位人才引進資訊匯總————全國各地選調生資訊匯總</a>

之前正則：

  '<a\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+href="javascript:void\(0\)"\s+style="color:.*?;"\s+fbfw="外">(.*?)</a>'

就無效了，就要再去改為：

  '<a\s+href="javascript:void\(0\)"\s+ahref="https://bbs.csdn.net/topics/(.*?)"\s+fbfw="外"\s+style="color:.*?;">(.*?)</a>'

才可以匹配到。



更別說，萬一html中代碼有其他更大的變化

甚至是部分語法不規范的html代碼，re正則根本就沒法寫，因為太復雜，復雜到寫不出來



BeautifulSoup的優點：

與之相對：上述的，html代碼的小改動，比如屬性值出現的順序不同

甚至大點的變化，多出其他屬性值

甚至部分語法不規范的html代碼，BeautifulSoup都可以很好的內部處理掉

而之前的代碼，比如：

  soup.find_all('a', attrs={"fbfw":"外", "ahref": nonEmptyP})

都可以很好的繼續作業，而無需改動。



匯總起來就是：



re

  性能：好

  支持html程度：有限

    僅限于不是很復雜的，比較規整的html

BeautifulSoup

  性能：中等

  支持html程度：很好

    不僅支持復雜的html，還支持html內部元素和位置變化

    對于不規范的html也有很好的支持

"""



print(reVsBeautifulSoup)

相關教程是：
BeautifulSoup和re詳細對比

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/92308.html

標籤：腳本語言(Perl/Python)

上一篇：pygame貪吃蛇游戲圖形重影問題

下一篇：各位大佬，為啥我裝了JUPYTER notebook打不開網頁？需要手動復制去瀏覽器打開？