BEAUTIFULSOUP：如何在沒有css選擇器的情況下使用給定字串獲取標簽-有解無憂

我目前正在學習如何抓取網頁。

問題： 我無法使用 css 選擇器，因為在其他站點上，此標記的位置（順序）（有關估計開始時間的資訊）會發生變化。

我的目標：我如何檢索資訊：2022 年 1 月

HTML 片段：

<tr>
    <td headers="studyInfoColTitle">  Estimated <span style="display:inline;" class="term" data-term="Study Start Date" title="Show definition">Study Start Date <i class="fa fa-info-circle term" aria-hidden="true" data-term="Study Start Date" style="border-bottom-style:none;"></i></span> : 
    </td>
    <td headers="studyInfoColData" style="padding-left:1em">January 2022</td>
</tr>

我試過的：

1.)我試圖宣告一個 func 來過濾掉（與 find_all 結合）這個標簽：

def searchMethod(tag):
        return re.compile("Estimated") and (str(tag.string).find("Estimated") > -1)
#calling here above func
foundTag_s = soup.find_all(searchMethod)

這對我其他類似的情況有所幫助，但在這里它不起作用，我認為這與字串文本在標簽之間的劃分方式有關......

2.)我嘗試使用字串搜索：

starttime_elem = soup.find("td", string="Estimated")

但由于某種原因它不起作用。

經過幾個小時的搜索，我決定在這里問。

參考： https : //clinicaltrials.gov/ct2/show/NCT05169372? draw =2& rank =1

uj5u.com熱心網友回復：

因此，您實際上是在查看同一域中的不同頁面。html在元素和屬性上基本一致。

CSS 選擇器串列比位置匹配更通用。有多種方法可以解決您當前的問題。

一種是簡單地使用css attribute = value css選擇器來定位開始日期節點然后移動到下一個td

import requests
from bs4 import BeautifulSoup as bs

links = ['https://clinicaltrials.gov/ct2/show/NCT05169372?draw=2&rank=1', 'https://clinicaltrials.gov/ct2/show/NCT05169359?draw=2&rank=2']

with requests.Session() as s:
    
    for link in links:
        
        r = s.get(link, headers = {'User-Agent':'Mozilla/5.0'})
        soup = bs(r.content, 'lxml')
        start = soup.select_one('[data-term="Study Start Date"]')

        if start is not None:
            
            print(start.text)
            print(start.find_next('td').text)

這是一個強大且一致的屬性。

您還可以使用:-soup-contains：

start = soup.select_one('.term:-soup-contains("Study Start Date")')

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/393650.html

標籤：Python html 网页抓取美汤

上一篇：Extractinf資訊表單沒有標簽的HTML

下一篇：如何進行JavaScript呼叫以從網站抓取資料？