資料框中多列的T檢驗-有解無憂

資料框看起來像：

decade     rain     snow
1910       0.2      0.2
1910       0.3      0.4
2000       0.4      0.5
2010       0.1      0.1

我希望在 python 中運行一個函式來比較給定列的十年組合的一些幫助。這個函式很好用，除了不接受輸入列，如雨或雪。

from itertools import combinations

def ttest_run(c1, c2):
    results = st.ttest_ind(cat1, cat2,nan_policy='omit')
    df = pd.DataFrame({'dec1': c1,
                       'dec2': c2,
                       'tstat': results.statistic,
                       'pvalue': results.pvalue}, 
                       index = [0])    
    return df

df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]

final_df = pd.concat(df_list, ignore_index = True)

uj5u.com熱心網友回復：

我想你想要這樣的東西：

import pandas as pd
from itertools import combinations
from scipy import stats as st


d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'], 
     'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4], 
     'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)


def all_pairwise(df, compare_col = 'decade'):
    decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
    # or add a list of colnames to function signature
    cols = list(df.columns)
    cols.remove(compare_col)
    list_of_dfs = []
    for pair in decade_pairs:
        for col in cols:
            c1 = df[df[compare_col] == pair[0]][col]
            c2 = df[df[compare_col] == pair[1]][col]
            results = st.ttest_ind(c1, c2, nan_policy='omit')
            tmp = pd.DataFrame({'dec1': pair[0],
                                'dec2': pair[1],
                                'tstat': results.statistic,
                                'pvalue': results.pvalue}, index = [col])
            list_of_dfs.append(tmp)
    df_stats = pd.concat(list_of_dfs)
    return df_stats

df_stats = all_pairwise(df)
df_stats

Nan現在，如果您執行該代碼，您將在計算導致輸出中的 s 的t 統計量時從太少的資料點發生除以 0 錯誤的運行時警告

>>> df_stats
      dec1  dec2     tstat    pvalue
rain  1910  2000       NaN       NaN
snow  1910  2000       NaN       NaN
rain  1910  2010       NaN       NaN
snow  1910  2010       NaN       NaN
rain  1910  1990  0.000000  1.000000
snow  1910  1990  0.436436  0.685044
rain  2000  2010       NaN       NaN
...

如果您不想要所有列，而只想要一些指定的集合，請將函式簽名/定義行更改為：

def all_pairwise(df, cols, compare_col = 'decade'):

wherecols應該是字串列名的可迭代（串列可以正常作業）。您需要洗掉這兩行：

    cols = list(df.columns)
    cols.remove(compare_col)

從函式體中，否則將正常作業。

除非您在傳遞給函式之前過濾掉記錄太少的幾十年，否則您將始終收到運行時警告。

這是來自版本的示例呼叫，它接受列串列作為引數并顯示運行時警告。

>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
      dec1  dec2  tstat  pvalue
rain  1910  2000    NaN     NaN
rain  1910  2010    NaN     NaN
rain  1910  1990    0.0     1.0
rain  2000  2010    NaN     NaN
rain  2000  1990    NaN     NaN
rain  2010  1990    NaN     NaN
>>>

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/435663.html

標籤：Python 功能 scipy scipy.stats

上一篇：按鈕功能不從React中的另一個檔案呼叫

下一篇：我怎樣才能把這個總和列印出來呢？