在Python中抓取包含某些字符和名稱的文本？-有解無憂

我對 python 相當陌生，正在從事一個專案，我需要在一堆文章中參考某些人的所有參考。

對于這個問題，我以這篇文章為例：https : //www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme

現在，使用 Lambda，我可以使用以下代碼抓取包含我正在尋找的人員姓名的文本：

import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme'
response = requests.get(url)
data=response.text
soup=BeautifulSoup(data,'html.parser')
tags=soup.find_all('p')
words = ["Michael Bromwich"]
for tag in tags:
    quotes=soup.find("p",{"class":"dcr-s23rjr"}, text=lambda text: text and any(x in text for x in words)).text

print(quotes)

...回傳包含“Michael Bromwich”的文本塊，在這種情況下，它實際上是文章中的參考。但是當抓取 100 多篇文章時，這并不能解決問題，因為其他文本塊也可能包含指定的名稱而不包含引號。我只想要包含引號的文本字串。

因此，我的問題 是：是否可以在以下條件下列印所有 HTML 字串：

文本以字符“（引號）或 -（連字符）開頭并包含名稱“Michael Bromwich”或“John Johnson”等。

謝謝！

uj5u.com熱心網友回復：

首先，你不需要for tag in tags回圈，你只需要使用soup.find_all你的條件。

接下來，您可以在沒有任何正則運算式的情況下檢查引號或連字符：

quotes = [x.text for x in  soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and (t.startswith("“") or t.startswith('"') or t.startswith("-")) and any(x in t for x in words))]

該(t.startswith("“") or t.startswith('"') or t.startswith("-"))部分將檢查文本是否以“,"或開頭-。

或者，

quotes = [x.text for x in  soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and t.strip()[0] in '“"-' and any(x in t for x in words))]

該t.strip()[0] in '“"-'部分檢查是否“"-包含剝離文本值的第一個字符。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/323645.html

標籤：Python 正则表达式拉姆达美汤引号

上一篇：正則運算式匹配單詞結尾或以連字符開頭

下一篇：Regex-將數字替換為字符