連接多個具有相同列名的CSV-有解無憂

我在連接這些 Pandas 資料幀時遇到了麻煩，因為我不斷收到錯誤訊息，說pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects我也在嘗試讓我的代碼不那么笨拙并運行更流暢。我還想知道是否有辦法使用 python 在一個 csv 上獲取多個頁面。任何幫助都會很棒。

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd[]=any&city[]=any&prop_type[]=R&prop_type[]=P&prop_type[]=MH&active[]=1&year=2021&sort=G&page_number=1"

t = URL   "&page_number="
URL2 = t   "2"
URL3 = t   "3"

s = requests.Session()

data = []

page = s.get(URL,headers=headers)
page2 = s.get(URL2, headers=headers)
page3 = s.get(URL3, headers=headers)

soup = BeautifulSoup(page.content, "lxml")
soup2 = BeautifulSoup(page2.content, "lxml")
soup3 = BeautifulSoup(page3.content, "lxml")


for row in soup.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup2.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup3.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])


df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])

final = pd.concat([df1, df2, df3], axis=0)

final.to_csv('Street.csv', encoding='utf-8')

uj5u.com熱心網友回復：

怎么了？

如前所述@Zach Youngdata已經保存了您想轉換為一個資料幀的所有行。所以這不是一個問題，pandas更多的是如何收集資訊的問題。

怎么修？

基于您問題中的代碼的一種方法是選擇更具體的表資料 - 請注意tbody選擇中的，這將排除標題：

for row in soup.select('#propertysearchresults tbody tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])

在創建資料框時，您可以另外設定列標題：

pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])

例子

這將展示如何迭代包含您的表格的網站的不同頁面：

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd[]=any&city[]=any&prop_type[]=R&prop_type[]=P&prop_type[]=MH&active[]=1&year=2021&sort=G&page_number=1"

s = requests.Session()

data = []
while True:

    page = s.get(URL,headers=headers)
    soup = BeautifulSoup(page.content, "lxml")

    for row in soup.select('#propertysearchresults tbody tr'):
        data.append([c.get_text(' ',strip=True) for c in row.select('td')])

    if (a := soup.select_one('#page_selector strong   a')):
        URL = "https://www.collincad.org" a['href']
    else:
        break


pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])

輸出

	物業編號 ↓ 地理編號 ↓	業主姓名	物業地址	法律說明	2021年市場價值
1	2709013 R-10644-00H-0010-1	PARTHASARATHY SURESH & ANITHA HARIKRISHNAN	12209 Willowgate Dr Frisco, TX 75035	Panther Creek Phase 2, Blk H, Lot 1 的 Ridgeview	513,019 美元
...	...	...	...	...	...
61	2129238 R-4734-00C-0110-1	赫弗·阿倫	990 Willowgate Dr Prosper, TX 75078	Willow Ridge 第一期，Blk C，Lot 11	509,795 美元

uj5u.com熱心網友回復：

通常一個人會遍歷頁碼并連接一個資料框串列，但如果你只有三頁，你的代碼就可以了。

因為for row in ...總是寫入data，你的最終資料幀是 df1，但你只需要洗掉列命名的行。

final = df1[df1['Property ID ↓ Geographic ID ↓']!='Property ID ↓ Geographic ID ↓']

uj5u.com熱心網友回復：

而不是你的最后幾行代碼：

df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])

final = pd.concat([df1, df2, df3], axis=0)

final.to_csv('Street.csv', encoding='utf-8')

您可以使用它（避免切片到不同的資料幀和串聯）：

final = pd.DataFrame(data[1:], columns=data[0])   # Sets the first row as the column names
final = final.iloc[:,1:]   # Gets rid of the additional index column

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/398842.html

標籤：蟒蛇-3.x 熊猫数据框级联导出到 csv

上一篇：pythonpandasDataFrame-為多個單元格分配一個串列

下一篇：列出超出SEC網路爬蟲范圍的索引