讀取帶有空格分隔符的csv時，值中的額外空格（不是尾隨空格，沒有引號）-有解無憂

我正在嘗試使用 Pandas 閱讀您在此處找到的檔案。我保存在本地目錄中。我被迫使用 Python 3.6

import requests
r = requests.get('https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt')

with open('DWD_weather_stations.txt','w') as fd:
    fd.write(r.text)

首先我試過這個：

weather_stations = pd.read_csv("DWD_weather_stations.txt",
                                sep="\s ",
                                header=0,  
                                skiprows=[1], 
                                skipinitialspace=True, 
                                engine='python')

我得到的錯誤如下：

ParserError：標記資料時出錯。C 錯誤：第 10 行預期有 8 個欄位，看到 9

一開始我沒有明白問題是什么，我嘗試了一個不同的分隔符：

weather_stations = pd.read_csv("DWD_weather_stations.txt", sep='\s{2,}',header=[0], skiprows=[1], engine='python')

現在發生了一些奇怪的事情：除了最后一列之外的所有列都被識別為索引，而最后一列被識別為唯一的列。我回想起之前的錯誤，我看到在第 9 行（上面的最后一個）名為 Stationsname（半最后）的列中有一個額外的單個空格。

Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge Stationsname Bundesland
----------- --------- --------- ------------- --------- --------- ----------------------------------------- ----------
00003 19500401 20110331            202     50.7827    6.0941 Aachen                                   Nordrhein-Westfalen                                                                               
00044 20070401 20211221             44     52.9336    8.2370 Gro?enkneten                             Niedersachsen                                                                                     
00052 19760101 19880101             46     53.6623   10.1990 Ahrensburg-Wulfsdorf                     Schleswig-Holstein                                                                                
00071 20091201 20191231            759     48.2156    8.9784 Albstadt-Badkap                          Baden-Württemberg                                                                                 
00073 20070401 20211221            340     48.6159   13.0506 Aldersbach-Kriestorf                     Bayern                                                                                            
00078 20041101 20211221             65     52.4853    7.9126 Alfhausen                                Niedersachsen                                                                                     
00091 20040901 20211221            300     50.7446    9.3450 Alsfeld-Eifa                             Hessen                                                                                            
00096 20190409 20211221             50     52.9437   12.8518 Neuruppin-Alt Ruppin                     Brandenburg                                                                                       
00102 20020101 20211221             32     53.8633    8.1275 Leuchtturm Alte Weser                    Niedersachsen

其他列也存在同樣的問題：我嘗試使用 skiprows 引數，但它無法正常作業。更改要跳過的行后，行數發生了變化，我猜它沒有正確識別行結尾。

一種殘酷的解決方案是手動修改行并使用破折號轉換值中的錯誤空格。但我想知道是否有更好的解決方案。

uj5u.com熱心網友回復：

我認為沒有簡單的解決方案，因為該檔案與 csv 的任何約定都不一致。如果我們將空格視為分隔符，則前三列僅使用一個空格，但該字符可以包含在最后兩列中。所以空格不能用作分隔符。

正確的分隔符是每列中不變的字符數。不幸的是，pandas 沒有提供指定這種分隔符的方法。我的建議是將檔案下載為 txt，閱讀它并根據每列的字符數拆分列。然后創建 csv 并使用 Pandas 輕松閱讀。

首先我確定了列的長度

columns = [5, 9, 9, 15, 12, 11, 41, 41]

然后我將 txt 轉換為 csv 分割每一行

from itertools import islice

csv_file = []
with open('DWD_weather_stations.txt') as file:
    for i, row in enumerate(file.readlines()):
        if i == 0:
            headers = row.split()
            csv_file.append(";".join(headers))
            continue
        elif i == 1:
            continue
        
        iter_row = iter(row)
        csv_row = ';'.join(''.join(islice(iter_row, None, x)).strip() for x in columns)
        csv_file.append(csv_row)

with open('DWD_weather_stations.csv', "w") as file:
    file.write("\n".join(csv_file))

幾條評論：

第一行是標題，它被單獨分析，因為作為分隔符可以很容易地使用空格和每列的字符數與其他行不同
由于這個原因我跳過了第二行
用于 CSV 的分隔符用于;列和\n行

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/391109.html

標籤：Python 熊猫文件

上一篇：使用DictReader讀取csv時不列印一行

下一篇：從while回圈匯出到csv檔案