我正在使用 beautifulsoup 嘗試根據其內容在 xml 決議樹中定位 P 標記:
# Import required modules.
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.find_all("P", text="(See § 125.4 of this subchapter for exemptions.) "):
print(i)
i.decompose()
運行此代碼時,即使我通過查看 XML 檔案(包括尾隨的 nbsp)知道該元素存在,我也會收到一個 NoneType 物件(將 None 列印到控制臺)。美麗的湯是否有 Unicode 問題,還是我錯過了其他東西?
謝謝!
uj5u.com熱心網友回復:
主要問題是text="(See § 125.4 of this subchapter for exemptions.) "尋找完全匹配,但找不到,因為在您的 xml 中它看起來像(<I>See</I> § 125.4 of this subchapter for exemptions.) .
css selectors您可以使用and修復該行為:-soup-contains():
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()
例子
from datetime import date
import requests
from bs4 import BeautifulSoup
# Determine today's date.
today = date.today()
# Define the URL to be scraped.
url = f"https://www.ecfr.gov/api/versioner/v1/full/{today}/title-22.xml?chapter=I&subchapter=M&part=121§ion=121.1"
# Initialize a requests Response object.
page = requests.get(url)
# Parse the XML data with BeautifulSoup.
soup = BeautifulSoup(page.content, features="xml")
# Remove tags with irregular text.
for i in soup.select('P:-soup-contains("See § 125.4 of this subchapter for exemptions.")'):
print(i)
i.decompose()
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/481308.html
