從附有邏輯的<h1>元素中提取字串-有解無憂

我正在嘗試抓取一些體育比賽資料，但我的代碼遇到了一些問題。最終，我會將這些資料移動到資料框中，然后最終移動到資料庫中。

我正在嘗試抓取一些體育資料。

在代碼中，我找到了要決議的標題之一的類元素。我正在決議的 HTML 中有多個 h1。

 <div class="type-game">
      <div class="type">NHL Regular Season</div>
      <h1>Blackhawks vs. Ducks</h1>
 </div>

有了這個 HTML 結構，我怎樣才能讓 h1 回傳一個我可以用來填充資料框的字串？

到目前為止我嘗試過的代碼是：

 req = requests.get(url) #   str(page)   '/')
 soup = bs(req.text, 'html.parser')

 stype = soup.find('h1', class_ ='type-game')
 print(stype)

此代碼回傳“無”。我在這里查看了其他文章，到目前為止沒有任何效果。

對于我的問題的下一個級別，有沒有辦法創建一個 For 回圈或類似的方法來遍歷包含字串的任何游戲的所有頁面（網站按事件順序編號）？

例如，如果我只想為具有 class= type-game 的 div 元素保存 h1 中包含 Chicago Blackhawks 的游戲？

偽代碼將是這樣的：

 For webpages 1 to 10000:
      if class_='type-game' 'h1' contains "Blackhawks"
           then proceed with parsing the code
      if not, skip the code and go to the next webpage

我知道這有點開放，但我有良好的 VBA 背景，嘗試將這些編碼思想應用于 Python 是一個挑戰。

uj5u.com熱心網友回復：

選擇更具體的元素，例如css selectors：

soup.select('h1:-soup-contains("Blackhawks")')

或者

soup.select('div.type-game h1:-soup-contains("Blackhawks")')

要從標簽中獲取文本，只需使用.text或get_text()

for e in soup.select('h1:-soup-contains("Blackhawks")'):
    print(e.text)

例子

html='''
<div class="type-game">
      <div class="type">NHL Regular Season</div>
      <h1>Blackhawks vs. Ducks</h1>
</div>
<div class="type-game">
      <div class="type">NHL Regular Season</div>
      <h1>Hawks vs. Ducks</h1>
</div>
<div class="type-game">
      <div class="type">NHL Regular Season</div>
      <h1>Ducks vs. Blackhawks</h1>
</div>
'''

soup = BeautifulSoup(html,'lxml')

for e in soup.select('h1:-soup-contains("Blackhawks")'):
    print(e.text)

輸出

Blackhawks vs. Ducks
Ducks vs. Blackhawks

編輯

for e in soup.select('div.type-game h1'):
    if 'Blackhawks' in e:
        pint(e.text)#or do what ever is to do

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/408639.html

標籤：

上一篇：For回圈：保留精確的字串（帶空格和引號）以識別單詞出現（python）

下一篇：Python如何自動為資料透視表和for回圈定義函式