用變數搜索漂亮的湯html-有解無憂

對于串列中的每個物種，我正在搜索一個網頁，所有網頁都應包含具有字典樣式資訊<dt> english name </dt> <dd> water shrew </dd> , <dt> status </dt> <dd> endangered </dd>等的相同文本框。如上所述，我想要的這些資訊位于一個文本框中，前面有一個標題：<h2 id="_02"> COSEWIC assessment aumary</h2>。這是它的實際外觀。用變數搜索漂亮的湯html

我最終試圖從這個框中提取“瀕危”字串，特別是稍后我想輸入到字典中，包括物種名稱等。對于我遍歷 URL 的每個物種都會略有不同，盡管頁面應該是以相同的方式構建，但包含有關不同物種的資訊。

由于每個物種的“狀態”和“英文名稱”的答案會有所不同，我無法自己查找這些文本，此外，我無法使用 if-else 陳述句，因為它不是頁面上唯一的位置出現關鍵字“瀕危”或“受威脅”的地方。那么有沒有辦法只選擇該文本框中的元素然后進一步搜索？（也不是頁面上唯一的文本框）。或者通過 dt 搜索并檢索相應的 dd？

供參考：https : //www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/pacific-water-shrew-appraisal-summary-2016 .html

謝謝你的時間！！！

uj5u.com熱心網友回復：

假設我正確理解你，應該這樣做：

from bs4 import BeautifulSoup as bs
import requests

url = """https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/pacific-water-shrew-appraisal-summary-2016.html"""
req = requests.get(url)

soup = bs(req.text, 'html.parser')
sumr = soup.select_one('div:has(> h2:-soup-contains-own("COSEWIC assessment aummary")) div[] .dl-horizontal')
targets = sumr.select('dt:has(strong)')
for target in targets:
    print(target.text.strip(),":", target.find_next('dd').text.strip())

輸出：

Common name : Pacific Water Shrew
Scientific name : Sorex bendirii
Status : Endangered
Reason for designation : This shrew is restricted to British Columbia’s Lower Mainland and adjacent low valleys. It is rare there, associated with freshwater streams and adjacent wet habitats. Urban development, agriculture, and forestry have reduced the amount and quality of habitat. There is an inferred and projected ongoing decline in habitat and subpopulations in much of its range in Canada.
Occurrence : British Columbia
Status history : Designated Threatened in April 1994 and in May 2000. Status re-examined and designated Endangered in April 2006. Status re-examined and confirmed in April 2016.

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/377795.html

標籤：html 网页抓取美汤

上一篇：學習網頁抓取..需要對使用xpath插件的xpath="/html/body/div[3]/div[3]/div[4]/div/table[5]有所了解

下一篇：網頁抓取不均勻的非表格內容-標題有多個值時的問題