完全初學者。請幫忙。我有這個代碼,當我沒有嘗試輸出到 .csv 而是在那里有一個列印命令時它起作用了 - 所以我沒有最后兩行或任何與變數“資料”相關的東西。“有效”是指它列印了所有 18 頁的資料。
現在它將資料輸出到 .csv 但僅從第一頁 (url) 輸出。
我看到最后我沒有將 nexturl 傳遞給大熊貓 - 因為我不知道該怎么做。非常感謝幫助。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.marketresearch.com/search/results.asp?qtype=2&datepub=3&publisher=Technavio&categoryid=0&sortby=r'
def scrape_it(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
data = []
for report in reports:
data.append({
'title': report.find('a', class_='linkTitle').text,
'price': report.find('div', class_='resultPrice').text,
'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
'detail_link': report.a['href']
})
if 'next' not in stri:
print("All pages completed")
else:
scrape_it(nexturl)
return data
myOutput = pd.DataFrame(scrape_it(url))
myOutput.to_csv(f'results-tec6.csv', header=False)
uj5u.com熱心網友回復:
設為data全域,以便您在回圈期間繼續附加到它,而不是重新創建。然后讓你的遞回函式在呼叫之外被DataFrame()呼叫,這樣你就可以傳遞data給熊貓。
最后,您可以傳遞 cookie 以獲得每個請求的最大可能結果,以減少請求數量。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.marketresearch.com/search/results.asp?qtype=2&datepub=3&publisher=Technavio&categoryid=0&sortby=r&page=1'
data = []
def scrape_it(url):
page = requests.get(url, headers = {'Cookie':'ResultsPerPage=100'})
soup = BeautifulSoup(page.text, 'html.parser')
nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
for report in reports:
data.append({
'title': report.find('a', class_='linkTitle').text,
'price': report.find('div', class_='resultPrice').text,
'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
'detail_link': report.a['href']
})
if 'next' not in stri:
print("All pages completed")
else:
scrape_it(nexturl)
scrape_it(url)
myOutput = pd.DataFrame(data)
myOutput.to_csv(f'results-tec6.csv', header=False)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/365328.html
上一篇:Mechanize::ResponseCodeError(404=>Net::HTTPNotFound未處理的回應):
