如何從抓取的html中提取書名（字串）？-有解無憂

我對網路抓取領域相當陌生，我正在嘗試提取以下網址https://www.goodreads.com/choiceawards/best-fiction-books-2020 中可用的書籍串列。我試圖從部分中獲取所有標題，例如 title="The Midnight Library by Matt Haig"，但不是很成功，我只回傳一個空白區域，沒有結果。任何人都可以就此提出建議嗎？謝謝。

到目前為止我所得到的：

from bs4 import BeautifulSoup as bs

url= "https://www.goodreads.com/choiceawards/best-fiction-books-2020" 
page = requests.get(url) 
soup = bs(page.content, 'html.parser') 

soup.find_all('a', class_='pollAnswer__bookLink')

for book in soup.find_all('a',  {'class':'pollAnswer__bookLink'}) :
    print(book.get_text())

uj5u.com熱心網友回復：

因此，首先您需要遍歷找到的標簽并獲取 img 標簽的內容，然后使用 get() 獲取標簽屬性的值。

from bs4 import BeautifulSoup as bs
import requests

url = "https://www.goodreads.com/choiceawards/best-fiction-books-2020"
page = requests.get(url)
soup = bs(page.content, "html.parser")

result = soup.find_all("a", class_="pollAnswer__bookLink")


for book in result:
    print(book.img.get("title"))

輸出：

The Midnight Library by Matt Haig
Anxious People by Fredrik Backman
American Dirt by Jeanine Cummins
Such a Fun Age by Kiley Reid
My Dark Vanessa by Kate Elizabeth Russell
The Glass Hotel by Emily St. John Mandel
Transcendent Kingdom by Yaa Gyasi
The Girl with the Louding Voice by Abi Daré
Dear Edward by Ann Napolitano
Big Summer by Jennifer Weiner
Writers & Lovers by Lily King
If I Had Your Face by Frances Cha
A Burning by Megha Majumdar
Luster by Raven Leilani
In an Instant by Suzanne Redfearn
Oona Out of Order by Margarita Montimore
The Death of Vivek Oji by Akwaeke Emezi
Homeland Elegies by Ayad Akhtar
Real Life by Brandon  Taylor
Migrations by Charlotte McConaghy

uj5u.com熱心網友回復：

怎么了？

你非常接近解決方案，但主要的問題是，您要找的是不是人類可讀的文本所提供的文本，它被存盤為值title的屬性<img>。

怎么修？

只需<img>在您的迭代中選擇并在title更改時獲取它：

print(book.get_text())

到

print(book.img['title'])

或者

print(book.img.get('title'))

注意：“方法”之間的區別在于，.get()如果attribute不可用，則將為您提供 None ，而通過 [attr] 直接選擇會引發錯誤

例子

from bs4 import BeautifulSoup as bs

url= "https://www.goodreads.com/choiceawards/best-fiction-books-2020" 
page = requests.get(url) 
soup = bs(page.content, 'html.parser') 

soup.find_all('a', class_='pollAnswer__bookLink')

for book in soup.find_all('a',  {'class':'pollAnswer__bookLink'}) :
    print(book.img['title'])

輸出

The Midnight Library by Matt Haig
Anxious People by Fredrik Backman
American Dirt by Jeanine Cummins
Such a Fun Age by Kiley Reid
My Dark Vanessa by Kate Elizabeth Russell
The Glass Hotel by Emily St. John Mandel
Transcendent Kingdom by Yaa Gyasi
The Girl with the Louding Voice by Abi Daré
Dear Edward by Ann Napolitano
Big Summer by Jennifer Weiner
...

uj5u.com熱心網友回復：

Web Scraping 的概念是資料應該被帶到任何你想要的網站。應該直接從網站上抓取。協議應該像HTTP，HTTPS等......在python中，我們有一個名為beautifulsoup的庫，通過使用我們可以輕松地從網站中提取資料。

因此，首先您需要遍歷找到的標簽并獲取 img 標簽的內容，然后使用 get() 獲取標簽屬性的值。

使用此代碼：代碼

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/382748.html

標籤：Python html 网页抓取

上一篇：使用vba中的動態陣列函式從網站抓取資料

下一篇：在Dart中async/await/then是如何真正作業的？