從瀏覽器復制的CSS選擇器在Python中使用BeautifulSoup4回傳不同的結果-有解無憂

通常當我想從網站上抓取特定文本時，我右鍵單擊文本并選擇檢查。然后在 HTML 代碼中，尋找我感興趣的文本和right-click -> 'copy' -> 'copy selector'.

然后我將剛剛復制的文本字串粘貼到soup.select('在此處輸入復制的文本') 并將其保存到變數中。然后我可以執行文本剝離功能來獲取我需要的關鍵文本。

現在對于我正在處理的情況，我想在標題中獲取此網頁上顯示的汽車總數h1: cars.com/cars/used/.

這是我的代碼：

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.cars.com/used"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'}

res = requests.get(url,headers = headers)
res.raise_for_status()

soup = bs(res.text, 'html.parser')

total_cars_element = soup.select('body > div.listing > div.container.listing-container.has-header-sticky > div.row.flex-nowrap.no-gutters > div:nth-child(1) > div:nth-child(1) > div')

print(total_cars)
# the above prints an empty list.

我真的只是想知道為什么這不起作用。正如我在上面的代碼中提到的，我了解還有其他解決方法。但我真的很想堅持使用 soup.select 方法。

任何見解都非常感謝！謝謝！

uj5u.com熱心網友回復：

問題源于這樣一個事實，即通過 Python 獲取的 HTML 與在瀏覽器中生成的 HTML 不同。嘗試列印soup并親自查看。

作為查詢的一部分的一個特定標簽很麻煩。在瀏覽器中，它看起來像這樣：

<div class="container listing-container has-header-sticky">

但是您的 Python 代碼卻看到了這一點：

<div class="container listing-container">

將您的選擇器更改為：

body > div.listing > div.container.listing-container > div.row.flex-nowrap.no-gutters > div:nth-child(1) > div:nth-child(1) > div

你會得到預期的結果。

這種行為被認為是正常的，因為您嘗試抓取的頁面是動態的。這意味著 JavaScript在頁面加載后添加或洗掉原始 HTML 頁面的某些部分。

如果你想使用 Python 抓取動態網頁，你需要的不僅僅是 Beautiful Soup。有關該主題的更多資訊，請參閱https://scrapingant.com/blog/scrape-dynamic-website-with-python。

uj5u.com熱心網友回復：

@Janez Kuhar很好的答案，你也可以使用

total_cars_element = soup.select('h1.title')
print(total_cars_element[0].text)

更多關于CSS 選擇

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/485042.html

標籤：Python python-3.x 网页抓取美丽的汤

上一篇：我需要對googlenews進行網路抓取，以獲取來自不同報紙的不同文章的鏈接

下一篇：pythonBeautifulSoupWikipediaWebscapping-learning