撰寫了一些代碼來抓取網站:https://books.toscrape.com/catalogue/page-1.html但我收到一個錯誤:
Nontype object has no attribute text
無法為此找到解決方案,所以我該如何解決此錯誤?
import requests
from bs4 import BeautifulSoup
import pandas as pd
all_books=[]
url='https://books.toscrape.com/catalogue/page-1.html'
headers=('https://developers.whatismybrowser.com/useragents/parse/22526098chrome-windows-blink')
def get_page(url):
page=requests.get(url,headers)
status=page.status_code
soup=BeautifulSoup(page.text,'html.parser')
return [soup,status]
#get all books links
def get_links(soup):
links=[]
listings=soup.find_all(class_='product_pod')
for listing in listings:
bk_link=listing.find("h3").a.get("href")
base_url='https://books.toscrape.com/catalogue/page-1.html'
cmplt_link=base_url bk_link
links.append(cmplt_link)
return links
#extraxt info from each link
def extract_info(links):
for link in links:
r=requests.get(link).text
book_soup=BeautifulSoup(r,'html.parser')
name=book_soup.find(class_='col-sm-6 product_main').text.strip()
price=book_soup.find(class_='col-sm-6 product_main').text.strip()
desc=book_soup.find(class_='sub-header').text.strip()
cat=book_soup.find('"../category/books/poetry_23/index.html">Poetry').text.strip()
book={'name':name,'price':price,'desc':desc,'cat':cat}
all_books.append(book)
pg=48
while True:
url=f'https://books.toscrape.com/catalogue/page-{pg}.html'
soup_status=get_page(url)
if soup_status[1]==200:
print(f"scrapping page{pg}")
extract_info(get_links(soup_status[0]))
pg =1
else:
print("The End")
break
df=pd.DataFrame(all_books)
print(df)
uj5u.com熱心網友回復:
注意 首先,請務必查看您的湯 - 這就是事實。內容總是與開發工具中的視圖略有不同。
怎么了?
您應該記住不同的問題:
base_url='https://books.toscrape.com/catalogue/page-1.html'將導致404 錯誤,并且是導致“非型別物件沒有屬性文本”的第一個原因您嘗試找到這樣的類別,這
cat=book_soup.find('"../category/books/poetry_23/index.html">Poetry').text.strip()將不起作用并導致相同的錯誤還有一些不會導致預期結果的選擇,看看我的例子,編輯它們給你一個如何實作目標的線索。
怎么修?
更改
base_url='https://books.toscrape.com/catalogue/page-1.html'為base_url='https://books.toscrape.com/catalogue/'選擇更具體的類別,它是
<a>面包屑中的最后一個:cat=book_soup.select('.breadcrumb a')[-1].text.strip()
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
all_books=[]
url='https://books.toscrape.com/catalogue/page-1.html'
headers=('https://developers.whatismybrowser.com/useragents/parse/22526098chrome-windows-blink')
def get_page(url):
page=requests.get(url,headers)
status=page.status_code
soup=BeautifulSoup(page.text,'html.parser')
return [soup,status]
#get all books links
def get_links(soup):
links=[]
listings=soup.find_all(class_='product_pod')
for listing in listings:
bk_link=listing.find("h3").a.get("href")
base_url='https://books.toscrape.com/catalogue/'
cmplt_link=base_url bk_link
links.append(cmplt_link)
return links
#extraxt info from each link
def extract_info(links):
for link in links:
r=requests.get(link).text
book_soup=BeautifulSoup(r,'html.parser')
name= name.text.strip() if (name := book_soup.h1) else None
price= price.text.strip() if (price := book_soup.select_one('h1 p')) else None
desc= desc.text.strip() if (desc := book_soup.select_one('#product_description p')) else None
cat= cat.text.strip() if (cat := book_soup.select('.breadcrumb a')[-1]) else None
book={'name':name,'price':price,'desc':desc,'cat':cat}
all_books.append(book)
pg=48
while True:
url=f'https://books.toscrape.com/catalogue/page-{pg}.html'
soup_status=get_page(url)
if soup_status[1]==200:
print(f"scrapping page{pg}")
extract_info(get_links(soup_status[0]))
pg =1
else:
print("The End")
break
all_books
uj5u.com熱心網友回復:
當您需要抓取元素的文本時,請使用以下功能。
它將保護您免受None元素侵害
def get_text(book_soup,clazz):
ele = book_soup.find(class_=clazz)
return ele.text.strip() if ele is not None else ''
示例。而不是
name=book_soup.find(class_='col-sm-6 product_main').text.strip()
做
name=get_text(book_soup,'col-sm-6 product_main')
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/387369.html
