抓取多個URL的python串列的方法-有解無憂

我為我需要的 URL 抓取了維基百科頁面，并將其附加到 python 中的一個空串列中。我現在需要抓取串列中的每個 URL 以獲取特定資訊，例如日期、坐標等。

鑒于 HTML 代碼的結構，父/子父結構，很多資訊不能單獨通過標簽鏈接。或者可以嗎？請參閱以下鏈接中的事實框：https : //en.wikipedia.org/wiki/1987_Maryland_train_collision。我的目標是抓取這些事實框，因為它們中的大多數都包含一個。

我知道您可以放入條件陳述句以從一組資料中宣告特定資料，并使用相同的 HTML 標記。但是，我不確定如何處理它。

到目前為止，我有以下內容：

list_of_urls = #my list of urls to be scraped


for i in list_of_urls:
        
        soup = BeautifulSoup(text, features="lxml")
        
        for item in soup.findAll('td',attrs={'class':'infobox-label'}):
            
            if item.find('td', attrs={'class':'infobox-data'})  == "date":
                print(item.find)
    
                date_info = item.get("infobox-data")
                print(date_info)

                #do something more..    

Any thoughts on the above?
Thank you for your time.

EDIT: Solved by applying Rusticus methods..

uj5u.com熱心網友回復：

您正在檢查的結構如下所示：

<tr>
  <th scope="row" class="infobox-label" style="white-space:nowrap;padding-right:0.65em;">Date</th>
  <td class="infobox-data" style="line-height:1.3em;">January 4, 1987 <br>1:30 PM</td>
</tr>

請注意，“infobox-label”位于 TH 標簽中，而不是 TD 標簽中。
item.find 是一種方法，您可能打算“列印（專案）”
找到 TH 標記后，您將需要移至 TD 標記以獲取值。有幾種方法可以做到這一點，我認為最簡單的是參考“item.parent.td”

也許你正在尋找這樣的東西：

    for item in soup.findAll('th',attrs={'class':'infobox-label'}):
        
        if item.text  == "Date":
            print(item)

            date_info = item.parent.td.text
            print(date_info)

或者只是：

soup.select_one('.infobox').find('th', text="Date").parent.td.text.strip()

對于坐標：

soup.select_one('.infobox').find('th', text="Coordinates").parent.td.select_one('.geo-dec').text.strip()

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/364265.html

標籤：Python 列表网址网页抓取

上一篇：Yii2Kartik-vgridView小部件。如何將#從'formatUrl'傳遞給url

下一篇：為什么我的發票在Stripe中HastedInvoiceUrl為空？