我想抓取https://www.airport-data.com/manuf/Reims.html并遍歷所有內容并將結果提取到AircraftListing.csv
代碼運行沒有錯誤,但結果填充不正確,并且并非所有記錄都從網頁提取到 .csv 檔案
如何將所有 Reims 航空記錄匯出到 AircraftListing.csv ?
import requests
from bs4 import BeautifulSoup
import csv
root_url = "https://www.airport-data.com/manuf/Reims.html"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')
paging = soup.find("table",{"class":"table table-bordered table-condensed"}).find_all("td")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
outfile = open('AircraftListing.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Tail_Number", "Year_Maker_Model", "C_N","Engines", "Seats", "Location"])
pages = list(range(1,int(last_page) 1))
for page in pages:
url = 'https://www.airport-data.com/manuf/Reims:%s.html' %(page)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print ('https://www.airport-data.com/manuf/Reims:%s' %(page))
product_name_list = soup.find("table",{"class":"table table-bordered table-condensed"}).find_all("td")
# Each row has 6 elements in it.
# Loop through every sixth element. (The first element of each row)
# Get all the other elements in the row by adding to index of the first.
for i in range(int(len(product_name_list)/6)):
Tail_Number = product_name_list[(i*6)].get_text('td')
Year_Maker_Model = product_name_list[(i*6) 1].get_text('td')
C_N = product_name_list[(i*6) 2].get_text('td')
Engines = product_name_list[(i*6) 3].get_text('td')
Seats = product_name_list[(i*6) 4].get_text('td')
Location = product_name_list[(i*6) 5].get_text('td')
writer.writerow([Tail_Number, Year_Maker_Model, C_N, Engines, Seats, Location])
outfile.close()
print ('Done')
uj5u.com熱心網友回復:
有更好的方法可以做到這一點,但在第 32-40 行使用:
# Each row has 6 elements in it.
# Loop through every sixth element. (The first element of each row)
# Get all the other elements in the row by adding to index of the first.
for i in range(int(len(product_name_list)/6)):
Tail_Number = product_name_list[(i*6)].get_text('td')
Year_Maker_Model = product_name_list[(i*6) 1].get_text('td')
C_N = product_name_list[(i*6) 2].get_text('td')
Engines = product_name_list[(i*6) 3].get_text('td')
Seats = product_name_list[(i*6) 4].get_text('td')
Location = product_name_list[(i*6) 5].get_text('td')
writer.writerow([Tail_Number, Year_Maker_Model, C_N, Engines, Seats, Location])
評論解釋了發生了什么。
uj5u.com熱心網友回復:
要改進您的代碼,尤其是帶有 for 回圈的部分,請嘗試更具體地選擇。而不是<td>select the <tr>,這可以最大限度地減少您在迭代中投入的精力并且更通用。
for row in soup.select('table tbody tr'):
writer.writerow([c.text if c.text else '' for c in row.select('td')])
例子
import requests, csv
from bs4 import BeautifulSoup
url = 'https://www.airport-data.com/manuf/Reims.html'
with open('AircraftListing.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Tail_Number", "Year_Maker_Model", "C_N","Engines", "Seats", "Location"])
while True:
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
for row in soup.select('table tbody tr'):
writer.writerow([c.text if c.text else '' for c in row.select('td')])
if soup.select_one('li.active li a'):
url = soup.select_one('li.active li a')['href']
else:
break
輸出
Tail Number,Year Maker Model,C/N,Engines,Seats,Location
0008,1987 Reims F406 Caravan II,F406-0008,2,14.0,France
0010,1987 Reims F406 Caravan II,F406-0010,2,12.0,France
13701,0000 Reims FTB337G,0002,2,4.0,Portugal
13705,0000 Reims FTB337G,0016,2,4.0,Portugal
13710,0000 Reims FTB337G,0011,2,4.0,Portugal
...,...,...,...,...,...
ZS-OHP,0000 Reims FR172J Reims Rocket,0496,1,4.0,South Africa
ZS-OTT,1989 Reims F406 Caravan II,F406-0040,2,12.0,South Africa
ZS-OXS,0000 Reims FR172J Reims Rocket,0418,1,4.0,South Africa
ZS-SSC,1988 Reims BPSW,F406-0032,2,12.0,South Africa
ZS-SSE,1990 Reims F406 Caravan II,F406-0043,2,12.0,South Africa
大熊貓的替代品
遍歷所有 51 個頁面的另一種方法是使用pandas.read_html獲取表、將它們附加到串列、concat()來自所有頁面的資料框并將它們保存為包含所有 5085 條記錄的 csv 檔案。
例子
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.airport-data.com/manuf/Reims.html'
data = []
while True:
#print(url)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table').prettify())[0])
if soup.select_one('li.active li a[href]'):
url = soup.select_one('li.active li a')['href']
else:
break
df = pd.concat(data)
df.to_csv('AircraftListing.csv',index=False)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/407903.html
標籤:
上一篇:使用HtmlUnit抓取整行
