這里是python初學者。我正在使用 BeautifulSoup 來抓取 books.toscrape.com 第一頁中所有書籍的詳細資訊(標題、庫存數量)。為此,必須首先獲取所有單本書籍的鏈接。我已經為相同的功能創建了 page1_url。問題是,在回傳提取的鏈接串列時,只回傳串列的第一個元素。請幫助識別錯誤或提供僅使用 BeautifulSoup 的替代代碼。提前致謝!
import requests
from bs4 import BeautifulSoup
def page1_url(page1):
response= requests.get(page1)
data= BeautifulSoup(response.text,'html.parser')
b1= data.find_all('h3')
for i in b1:
l=i.find_all('a')
for j in l:
l1=j['href']
books_urls=[]
books_urls.append(base_url l1)
books_urls=list(books_urls)
return books_urls
allPages = ['http://books.toscrape.com/catalogue/page-1.html',
'http://books.toscrape.com/catalogue/page-2.html']
base_url= 'http://books.toscrape.com/catalogue/'
bookURLs= page1_url(allPages[0])
print(bookURLs)
uj5u.com熱心網友回復:
您正在為每個鏈接重寫串列,并且在回圈books_urls中的第一個元素之后回傳函式:for j in l
import requests
from bs4 import BeautifulSoup
def page1_url(page1):
response= requests.get(page1)
data= BeautifulSoup(response.text,'html.parser')
b1= data.find_all('h3')
# you were rewriting this list for each link
books_urls = []
for i in b1:
l=i.find_all('a')
for j in l:
l1=j['href']
books_urls.append(base_url l1)
# these lines had too many indents
books_urls=list(books_urls)
return books_urls
allPages = ['http://books.toscrape.com/catalogue/page-1.html',
'http://books.toscrape.com/catalogue/page-2.html']
base_url= 'http://books.toscrape.com/catalogue/'
bookURLs= page1_url(allPages[0])
print(bookURLs)
['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'http://books.toscrape.com/catalogue/soumission_998/index.html', 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html', ... 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html']
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/490638.html
標籤:python-3.x 网页抓取 美丽的汤
上一篇:如何用scrapy提取Json
