Python：使用BeautifulSoup提取HTML<main>資料-有解無憂

我目前正在學習使用 BeautifulSoup 包進行資料抓取。目前，我正在嘗試從 Box Office Mojo 網站 ( https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab ) 獲取電影特許經營權串列。

主要問題是我似乎無法訪問或提取 <main> 標記中的資料。下面是我正在使用的代碼。

import requests
from bs4 import BeautifulSoup

listOfFranchiseLink = "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"

r = requests.get(listOfFranchiseLink)
soup = BeautifulSoup(r.content, 'html.parser')

s0 = soup.find('div', id='a-page')
s1 = s0.find(id='')
s2 = s1.find('div', id='a-section mojo-body aok-relative')

assert s1 is not None
assert s2 is not None

雖然腳本確實找到了帶有's1'的東西，但它似乎不像我所期望的那樣（它應該包含一個帶有“a-section mojo-body aok-relative”類的div）在頂部。因此，對于“s2”，我沒有得到任何結果。

我的問題是：

我究竟做錯了什么？如何提取 <main> 標簽內的資料？
我感覺為每一層創建一個湯物件不是很有效。提取隱藏在不同 HTML 標簽層中的資料的更標準方法是什么？

編輯：打算寫 s0.find('main') 而不是 s0.find(id='')。但是前者回傳的結果與后者相同，所以這并不重要。

uj5u.com熱心網友回復：

這是因為s2is實際上 None，因為s1回傳：

<script data-a-state='{"key":"a-wlab-states"}' type="a-state">{}</script>

所以搜索id='a-section mojo-body aok-relative不應該產生任何結果。因此第二個斷言失敗。

如果你想刮桌子，你可以用pandasand requests，像這樣：

import requests
import pandas as pd

df = (
    pd.read_html(
        requests.get(
            "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
        ).text,
        flavor="lxml",
    )[0]
)
print(df)

要得到這個：

                           Franchise  ... Lifetime Gross
0          Marvel Cinematic Universe  ...   $858,373,000
1                          Star Wars  ...   $936,662,225
2    Disney Live Action Reimaginings  ...   $543,638,043
3                         Spider-Man  ...   $804,789,334
4     J.K. Rowling's Wizarding World  ...   $381,011,219
..                               ...  ...            ...
287                 Ip Man Franchise  ...     $2,679,437
288                   Chal Mera Putt  ...       $644,000
289                           Shiloh  ...     $1,007,822
290                       Evangelion  ...       $174,945
291                            V/H/S  ...       $100,345

[292 rows x 5 columns]

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/494787.html

標籤：Python html 美丽的汤

上一篇：如何將JavaScript頁面鏈接到djangoweb應用程式？

下一篇：如何在wordpress的header.php中呼叫.hmtl檔案？