使用NaN對列進行Winsorizing不會改變最大值-有解無憂

請注意，不久前有人問過類似的問題，但從未回答過（請參閱Winsorizing 不會更改最大值）。

我正在嘗試winsorize使用winsorizefrom資料框中的一列scipy.stats.mstats。如果列中沒有 NaN 值，則該程序正常作業。

但是，NaN 值似乎會阻止該程序在分布的頂部（但不是底部）上作業。無論我為設定什么值nan_policy，NaN 值都設定為分布中的最大值。我覺得必須以某種方式錯誤地設定選項。

下面是一個示例，可用于在沒有 NaN 值時重現正確的 winsorizing 以及在存在 NaN 值時我遇到的問題行為。對解決此問題的任何幫助將不勝感激。

#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize

# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
 
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)

# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan

# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')

# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())

print()

print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())

uj5u.com熱心網友回復：

看起來好像nan_policy被忽略了。但是 winsorization 只是裁剪，所以你可以用 Pandas 處理這個。

def winsorize_with_pandas(s, limits):
    """
    s : pd.Series
        Series to winsorize
    limits : tuple of float
        Tuple of the percentages to cut on each side of the array, 
        with respect to the number of unmasked data, as floats between 0. and 1
    """
    return s.clip(lower=s.quantile(limits[0], interpolation='lower'), 
                  upper=s.quantile(1-limits[1], interpolation='higher'))


winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0      3.0
1      3.0
2      3.0
3      4.0
4      5.0
5      6.0
6      7.0
7      8.0
8      9.0
9     10.0
10    11.0
11    12.0
12    13.0
13    14.0
14    15.0
15    16.0
16    17.0
17    18.0
18    18.0
19    18.0
Name: Age, dtype: float64

winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0      2.0
1      2.0
2      3.0
3      4.0
4      5.0
5      NaN
6      7.0
7      8.0
8      NaN
9     10.0
10    11.0
11    12.0
12    13.0
13    14.0
14    15.0
15    16.0
16    17.0
17    18.0
18    19.0
19    19.0
Name: Age, dtype: float64

uj5u.com熱心網友回復：

您可以考慮用mean列中的填充缺失值，然后winsorize只選擇原始的非 nan

df2 = pd.DataFrame(data)

# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan

# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0

或通過在 winsorize 之前洗掉 nan 的其他選項。

df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/383412.html

標籤：Python 熊猫数据框麻木的 scipy

上一篇：為什么在時間頻率中采樣的函式不重復？

下一篇：滾動統計性能：pandas與numpystrides