我有一個 CSV 檔案,其中包含 1988-2020 的值。檔案中的前兩列包含年和月 (1988,1 .... 2020,12)。其余列包含該月每個日期的相應值。但是,特定日期存在缺失值,并且某些月份也缺失。缺失值需要替換為“NA”或“-99”,還需要考慮閏年。我需要將此資料轉換為以下格式,而不要跳過從 01-01-1988 到 31-12-2020 的日期。
Column1,Column2
“dd-mm-yyyy”, “Value”
示例資料
uj5u.com熱心網友回復:
在這里,我嘗試將您的資料重塑為從 01-01-1988 到 31-12-2020 的行,并NaN在每個缺失的月份中添加缺失的天數。可能錯過了一些東西,所以不要猶豫,自己檢查一下,如果有問題,請告訴我:
import pandas as pd
import numpy as np
# parse first two columns as dates
df = pd.read_csv('/path_to_file/Data.csv', parse_dates=[[0, 1]], header=None)
# rename columns
df.columns = range(0, 33)
# inspecting the df has shown that the row with index 2016-12-01 is invalid
# and the column 32 is empty
df = df.drop(columns=32).set_index(0).drop(index='2016-12-01')
# fill missing data in non-month-end days with -99
df.iloc[:, 0:28] = df.iloc[:, 0:28].fillna(value=-99, axis=0)
# reformat index
df.index = df.index.strftime('%m-%Y')
print(df)
中間輸出:
1 2 3 4 5 6 7 8 9 10 ... 22 \
0 ...
01-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
02-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
03-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
04-1988 0.0 0.0 0.0 0.0 0.0 0.0 36.1 0.0 0.0 0.0 ... 0.0
01-1990 9.8 13.0 0.0 0.0 0.0 0.0 16.8 26.0 10.0 0.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
08-2020 0.0 8.0 15.0 9.0 8.0 13.0 0.0 12.0 0.0 0.0 ... 0.0
09-2020 49.0 15.0 7.0 9.0 5.0 22.0 0.0 0.0 5.0 3.0 ... 0.0
10-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
11-2020 32.0 4.0 26.0 20.0 10.0 0.0 10.0 0.0 0.0 0.0 ... 0.0
12-2020 0.0 0.0 3.0 2.0 0.0 8.0 6.0 23.0 3.0 0.0 ... 0.0
23 24 25 26 27 28 29 30 31
0
01-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
02-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
03-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
04-1988 0.0 35.3 48.0 0.0 4.0 0.0 0.0 0.0 NaN
01-1990 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
08-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 36.0
09-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
10-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11-2020 0.0 0.0 0.0 0.0 32.0 2.0 5.0 0.0 NaN
12-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[199 rows x 31 columns]
# flatten df into 1-d array
flattened = df.to_numpy().flatten()
# expand months in df into days
valid_days = [pd.date_range(start=m, periods=pd.Period(m).days_in_month, freq='D')
for m in df.index]
# create datetime index by days
valid_days = pd.to_datetime([item for sublist in valid_days
for item in sublist]).strftime('%d-%m-%Y')
# create new DataFrame from flattened values and daily index
df_days = pd.DataFrame(flattened[~np.isnan(flattened)],
index=valid_days,
columns=['value'])
# create the range of all days within the period
total_days = pd.date_range('01-01-1988', '31-12-2020', freq='D').strftime('%d-%m-%Y')
# reindex df_days with all days within the period
df_days = df_days.reindex(total_days)
print(df_days)
輸出:
value
01-01-1988 0.0
02-01-1988 0.0
03-01-1988 0.0
04-01-1988 0.0
05-01-1988 0.0
... ...
27-12-2020 0.0
28-12-2020 0.0
29-12-2020 0.0
30-12-2020 0.0
31-12-2020 0.0
12054 rows × 1 columns
我很高興知道如何以更優雅的方式解決這個問題。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/511864.html
標籤:Python熊猫麻木的
