我正在嘗試廢棄一個網站https://lt.brcauto.eu/,并且需要從那里至少拿走 50 輛汽車。所以我從主頁轉到“汽車搜索頁面”,從一開始就開始抓取所有內容。但是,在一頁中只有 21 輛汽車,所以當汽車結束并且決議器應該轉到另一頁時,我收到一個錯誤,即list index out of range. 這就是我試圖抓取的方式:
import json
import requests
from bs4 import BeautifulSoup
mainURL = 'https://lt.brcauto.eu/'
req1 = requests.get(mainURL)
soup1 = BeautifulSoup(req1.text, 'lxml')
link = soup1.find('div', class_ = 'home-nav flex flex-wrap')
temp = link.findAll("a") # find search link
URL = (temp[1].get('href') '/')
req2 = requests.get(URL)
soup2 = BeautifulSoup(req2.text, 'lxml')
page = soup2.find_all('li', class_ = 'page-item')[-2] # search pages till max ">"
cars_printed_counter = 0
for number in range(1, int(page.text)): #from 1 until max page
req2 = requests.get(URL '?page=' str(number)) #page url
soup2 = BeautifulSoup(req2.text, 'lxml')
if cars_printed_counter == 50:
break # due faster execution
out = [] # holding all cars
for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):
if cars_printed_counter == 50:
break # after 5 cars
Car_Title = single_car.find('h2', class_ = 'cars__title')
Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
#print('\nCar number:', cars_printed_counter 1)
#print(Car_Title.text)
#print(Car_Specs.text)
car = {}
spl = Car_Specs.text.split(' | ')
car["fuel"] = spl [1].split(" ")[1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
我注意到如果我只列印這樣的汽車
for single_car in soup.find_all('div', class_ = 'cars-wrapper'):
if cars_printed_counter == 50:
break
Car_Title = single_car.find('h2', class_ = 'cars__title')
Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
Car_Price = single_car.find('div', class_ = 'w-full lg:w-auto cars-price text-right pt-1')
print('\nCar number:', cars_printed_counter 1)
print(Car_Title.text)
print(Car_Specs.text)
print(Car_Price.text)
cars_printed_counter = 1
一切正常。但是一旦我想把它們寫成這樣的 json 格式:
car = {}
spl = Car_Specs.text.split(' | ')
car["fuel"] = spl [1].split(" ")[1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
我收到串列索引超出范圍的錯誤。
PS 或者我應該已經在這里使用多執行緒了嗎?
uj5u.com熱心網友回復:
這個解決方案對我有用:
car = {}
spl = Car_Specs.text.split(' | ')
if spl[1].split(" ")[0] == 'Elektra': # break on Electric cars
break
car["fuel"] = spl [1].split(" ")[1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
所以我補充說:
if spl[1].split(" ")[0] == 'Elektra':
break
因為在刮第二個元素是包含一升的燃料型別。而當刮板遇到電動車時dict不能添加它,因為電動車沒有升。[0] is fuel type
uj5u.com熱心網友回復:
首先——暫時擱置多執行緒的想法。您的代碼還有其他問題:
如前所述,檢查問題代碼中的縮進,目前它沒有任何意義,因為您正在迭代所有站點,但只抓取最后一個。
導致的問題
IndexError: list index out of range
列印你的spl,你會看到以下問題——這輛車不能在內燃機上運行:
['2013', 'Elektra', 'Automatin?', '108030 km', '310 kW (422 AG)', 'M?lyna']
嘗試像你一樣選擇索引car["fuel"] = spl [1].split(" ")[1]會導致錯誤,而是這樣做(串列中的最后一個元素):
car["fuel"] = spl [1].split(" ")[-1]
例子
您的縮進應該看起來更像這樣,以迭代所有頁面并將汽車資訊存盤在out所有回圈之外:
...
cars_printed_counter = 0
out = [] # holding all cars
for number in range(1, int(page.text)): #from 1 until max page
req2 = requests.get(URL '?page=' str(number)) #page url
soup2 = BeautifulSoup(req2.text, 'lxml')
if cars_printed_counter == 50:
break # due faster execution
for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):
if cars_printed_counter == 50:
break # after 5 cars
Car_Title = single_car.find('h2', class_ = 'cars__title')
Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
car = {}
spl = Car_Specs.text.split(' | ')
print(spl)
car["fuel"] = spl [1].split(" ")[-1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
# print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/465111.html
