python-如何在Python中滾動熊貓的例外值檢測中考慮日期時間-有解無憂

考慮以下資料框：

         Temperature     Datetime       
1        24.72           2021-01-01 10:00:00        
2        25.76           2021-01-01 11:00:00         
3        40              2021-01-01 12:00:00   
4        25.31           2021-01-01 13:00:00       
5        26.21           2021-01-01 14:00:00   
6        26.59           2021-01-01 15:00:00
7        26.64           2021-01-01 20:00:00 
8        26.38           2021-01-01 21:00:00 
9        45              2021-01-01 22:00:00
10       26.23           2021-01-01 23:00:00  
...      ...             ...

我們想要實作的是去除例外值，例如，在 id 3 中溫度為 40，它顯然是一個例外值。我們要洗掉 id 3 的整行。我們已經閱讀了這個執行緒：Outlier detection based on the moving mean in Python。

在它描述的執行緒中，可以使用以下代碼洗掉例外值：

# Import Libraries
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
    'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})

# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1

# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()

# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']

# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)

# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)

但是我們想進一步擴展它，這樣只要有缺失值，中值就會重新開始。因此，在考慮我們說明的資料框時，我們看到了一個缺少值的示例：

         Temperature     Datetime       
1        24.72           2021-01-01 10:00:00        
2        25.76           2021-01-01 11:00:00         
3        40              2021-01-01 12:00:00   
4        25.31           2021-01-01 13:00:00       
5        26.21           2021-01-01 14:00:00   
6        26.59           2021-01-01 15:00:00
7        26.64           2021-01-01 20:00:00 <-- Reset due to missing data between this point and the one before  
8        26.38           2021-01-01 21:00:00 
9        45              2021-01-01 22:00:00
10       26.23           2021-01-01 23:00:00  
...      ...             ...

我們想要的是去除例外值的代碼也考慮了日期時間，所以在 id 7 中，我們確實注意到日期時間在 id 6 之后 5 小時，因此我們可以得出資料丟失的結論，因此我們想要重置中位數，因為我們不想要使用與例外值檢測無關的資料的滾動中位數/均值。我們可能有資料丟失數小時甚至數天的示例，如果滾動中位數不考慮這一點，它將提供糟糕的資料清理。一個理想的閾值是 1 小時，因此如果第二行不是正好在第一行之后一小時，則重置中位數。這可能嗎？

uj5u.com熱心網友回復：

在我看來，您應該使用日期時間功能來計算移動平均值。類似于計算給定時間周圍 n 小時的平均溫度，然后使用閾值比較您當前的溫度。

就像是：

df['Datetime'] = pd.to_datetime(df['Datetime'])
s = (df
 .rolling('5h', center=True, on='Datetime')
 ['Temperature'].mean()
)

# 10° diff, absolute threshold
df['outlier'] = df['Temperature'].sub(s).abs().gt(10)

df.loc[mask, 'outlier'] = True

# to drop the rows:
# df = df.loc[~mask]

輸出：

    Temperature            Datetime  outlier
1         24.72 2021-01-01 10:00:00    False
2         25.76 2021-01-01 11:00:00    False
3         40.00 2021-01-01 12:00:00     True
4         25.31 2021-01-01 13:00:00    False
5         26.21 2021-01-01 14:00:00    False
6         26.59 2021-01-01 15:00:00    False
7         26.64 2021-01-01 20:00:00    False
8         26.38 2021-01-01 21:00:00    False
9         45.00 2021-01-01 22:00:00     True
10        26.23 2021-01-01 23:00:00    False

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/446262.html

標籤：Python 熊猫数据框

上一篇：熊貓：將列從另一個df插入到某個位置的新df

下一篇：比較兩個資料框列值并在python中加入條件？