（僅）從csv匯入后，日期索引中的某些日期被解釋錯誤-有解無憂

我想在 python 中分析一個資料框。我加載了一個 csv，它由兩列組成，一列日期/時間和一列平均值。
我像這樣加載資料：

df = pd.read_csv('L_strom_30974_gerundet.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()

問題是，某些日期似乎被 python 解釋錯了。csv 的范圍僅從 01.01.2009 00:00 到 04.10.2010 23:45（原始格式）。但是當我將檔案加載到 python 中時，它還會在 plot 和 df.info 中顯示 2010 年 11 月和 2010 年 12 月的日期：

PeriodIndex: 61628 entries, 2009-01-01 00:00 to 2010-12-09 23:45
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mean    61628 non-null  float64
dtypes: float64(1)

我在 csv 中搜索了此時的值，但找不到任何值。此外，df.info 中的條目數與我的 csv 行匹配，所以我認為某些日期必須被錯誤地解釋。

匯入后我的資料框的尾部如下所示：

                  Mean
Timestamp             
2010-12-09 22:45   186
2010-12-09 23:00   206
2010-12-09 23:15   168
2010-12-09 23:30   150
2010-12-09 23:45   132

我搜索了類似的問題，但找不到解釋為什么大多數資料被正確解釋，但有些資料被錯誤解釋。任何想法？

uj5u.com熱心網友回復：

假設需要infer_datetime_format=True放棄你沒有傳遞正確的格式。看看strftime documentation。您正在使用：

format='%d.%m.%y %H:%M'
# %y = Year without century as a zero-padded decimal number: 09, 10

但是需要的格式是：

format='%d.%m.%Y %H:%M'
# %Y = Year with century as a decimal number: 2009, 2010

顯然，infer_datetime_format無法正確推斷每個字串，將數天視為數月，反之亦然。事實上，讓我們重現錯誤：

創建 csv：

import pandas as pd
import numpy as np

data = {'Timestamp': pd.date_range('01-01-2009', '10-04-2010', freq='H'),
        'Mean': np.random.randint(0,10,15385)}

df_orig = pd.DataFrame(data)
df_orig['Timestamp'] = df_orig['Timestamp'].dt.strftime('%d.%m.%Y %H:%M')
df_orig.to_csv('test.csv', sep=';', index=None, header=None)

# csv like:

01.01.2009 00:00;7
01.01.2009 01:00;6
01.01.2009 02:00;0
01.01.2009 03:00;2
01.01.2009 04:00;3

加載 csv 不正確：

df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])

df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%y %H:%M', 
                                 infer_datetime_format=True)
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df = df.sort_index()

df.info() # note the incorrect `PeriodIndex`, ending with `2010-12-09 23:00`

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-12-09 23:00
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Mean    15385 non-null  int64
dtypes: int64(1)
memory usage: 240.4 KB

正確加載 csv：

df = pd.read_csv('test.csv', sep=';', names=['Timestamp', 'Mean'])
df['Timestamp'] = pd.to_datetime(df.Timestamp,format= '%d.%m.%Y %H:%M')
df.set_index('Timestamp', inplace=True)
df.index = pd.DatetimeIndex(df.index).to_period('15T')
df.info()

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 15385 entries, 2009-01-01 00:00 to 2010-10-04 00:00
Freq: 15T
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Mean    15385 non-null  int64
dtypes: int64(1)
memory usage: 240.4 KB

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/521383.html

標籤：Python熊猫数据框约会时间索引

上一篇：如何使用日期時間索引洗掉多行？

下一篇：使用javascript中的給定日期范圍生成周開始日期、周結束日期、月份周數