我正在多次搜索下從多個頁面抓取鏈接,并希望將抓取的結果輸出到多個 .csv 檔案中。該表顯示了 .csv 檔案,其中列出了我的源 url 和所需的輸出檔案名:
| 網址 | 輸出檔案 |
|---|---|
| https://www.marketresearch.com/search/results.asp?categoryid=230&qtype=2&publisher=IDCs&datepub=0&submit2=Search | 輸出PS1xIDC.csv |
| https://www.marketresearch.com/search/results.asp?categoryid=90&qtype=2&publisher=IDC&datepub=0&submit2=Search | 輸出PS2xIDC.csv |
| https://www.marketresearch.com/search/results.asp?categoryid=233&qtype=2&publisher=IDC&datepub=0&submit2=Search | 輸出PS3xIDC.csv |
| https://www.marketresearch.com/search/results.asp?categoryid=169&qtype=2&publisher=IDC&datepub=0&submit2=Search | 輸出PS4xIDC.csv |
現在,使用下面的代碼,我設法按順序讀取 url,其余代碼也運行良好(當我直接指定輸出檔案名時)。但是,它只輸出串列中 4 頁中的最后一頁,因此每次都會覆寫結果。我真正想要的是將結果從第一個 url 輸出到第一個輸出檔案,第二個到第二個等等(當然,我的實際源 URL 串列比這 4 個要長得多)。
請幫忙,尤其是最后一行,因為很明顯只寫 [outputs] 是行不通的。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
with open('inputs.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
urls = [row["url"] for row in reader]
outputs = [row["outputfile"] for row in reader]
data = []
for url in urls:
def scrape_it(url):
page = requests.get(url, headers={'Cookie': 'ResultsPerPage=100'})
soup = BeautifulSoup(page.text, 'html.parser')
nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
for report in reports:
data.append({
'title': report.find('a', class_='linkTitle').text,
'price': report.find('div', class_='resultPrice').text,
'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
'detail_link': report.a['href']
})
if 'next' not in stri:
print("All pages completed")
else:
scrape_it(nexturl)
scrape_it(url)
myOutput = pd.DataFrame(data)
myOutput.to_csv([outputs], header=False) #works (but only for the last url) if instead of [outputs] I have f'filename.csv'
uj5u.com熱心網友回復:
我沒有 Pandas,我真的不想運行您的輸入,但是當我查看您的代碼時,有幾件事讓我感到震驚:
- 它看起來像你不是遍歷
url和output在一起。看起來您遍歷所有 URL,然后在所有這些回圈之后撰寫一次。 - 同樣,
data只是附加和附加 HTML 表資料,它永遠不會為每個單獨的 URL 重置。
如果無法運行它,我推薦這樣的東西。抓取完全封裝并與回圈分離,因此您現在可以更清楚地看到輸入和輸出的流程:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
def scrape_it(url, data):
page = requests.get(url, headers={'Cookie': 'ResultsPerPage=100'})
soup = BeautifulSoup(page.text, 'html.parser')
nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
for report in reports:
data.append({
'title': report.find('a', class_='linkTitle').text,
'price': report.find('div', class_='resultPrice').text,
'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
'detail_link': report.a['href']
})
if 'next' in stri:
data = scrape_it(nexturl, data)
return data
with open('inputs.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
urls = [row["url"] for row in reader]
outputs = [row["outputfile"] for row in reader]
for (url, output) in zip(urls, outputs): # work on url and output together
data = scrape_it(url, [])
myOutput = pd.DataFrame(data)
myOutput.to_csv(output, header=False)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/370861.html
