我想從下面的 URL 中抓取一個表格。抓取有效,但我遇到的問題是它只顯示來自第一個 URL 的資訊。如何修復我的代碼,以便它也添加第二個 URL 的資訊?我希望我的問題很清楚。
import pandas as pd
import requests
from bs4 import BeautifulSoup
urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
#df = pd.DataFrame()
dl = []# Storage for data
dt = []# Storage for column names
for url in urls:
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
dl_data = soup.find_all("dd") # Scraping the data
for dlitem in dl_data:
dl.append(dlitem.text.strip())
dt_data = soup.find_all("dt") # Scraping the column names
for dtitem in dt_data:
dt.append(dtitem.text.strip())
df = pd.DataFrame(dl) # Creating the dataframe
df = df.T # Transposing it because otherwise it is 1D
df.columns = dt # Giving the column names to the dataframe
uj5u.com熱心網友回復:
避免使用多個串列,只需選擇更精簡的方法來處理您的資料并以更結構化的方式保存,例如dict- 這些dict comprehension選擇創建 a之后的所有<dd>內容并將其附加到. 只需從此字典串列中創建一個:<dt>dictdataDataFrame
data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt dd')})
例子
import pandas as pd
import requests
from bs4 import BeautifulSoup
urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
data = []
for url in urls:
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt dd')})
pd.DataFrame(data)
uj5u.com熱心網友回復:
它看起來dl和dt沒有相同數量的元素(分別為 75 和 71)。因此,您不能dt用于列名。您可以通過添加填充(例如dt用零初始化串列)或洗掉dl串列中不必要的元素來解決此問題。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/447954.html
