在R中，如何撰寫在資料幀串列上運行t-test的函式-有解無憂

以iris資料集為例，我想撰寫一個用戶定義的函式

pairwise t-test在所有 4 列上運行Species，每個資料拆分的列除外
將結果匯出為 csv 檔案的 3 個作業表

請參閱下面的我的嘗試：

library(tidyr)
library(reshape) # for melting /stacking the data
library(multcomp) # for pairwise test 
library(xlsx) # export excel file with worksheet

options(scipen = 100)

# dataset
iris

data_stats <- function(data){
    # melt the dataframe
     df <- melt(data, id.vars=c('Species'),var='group')
    
# split the dataframe into three list of dataframe
    dfsplit<-split(df,df$column)
    
    #  pairwise t-test
    results <- pairwise.t.test(dfsplit$value, dfsplit$group,p.adjust.method = "BH")
    
    # export each result as a worksheet of an excel file
    write.xlsx(results, file="Results.xlsx", sheetName="versicolor_stats", row.names=FALSE)
    write.xlsx(results, file="Results.xlsx", sheetName="virginica_stats", append=TRUE, row.names=FALSE)
    write.xlsx(results, file="Results.xlsx", sheetName="setosa_stats", append=TRUE, row.names=FALSE)
    
}
# testing the code on iris data
data_stats(iris)

請評論并分享您的代碼。謝謝

uj5u.com熱心網友回復：

這是一個選項tidyverse- reshape to 'long' format with pivot_longer，然后用于執行group_modify，pairwise.t.test輸出tidy和輸出unnestlist

library(dplyr)
library(tidyr)
library(broom)
ttest_out <- iris %>% 
  pivot_longer(cols = -Species) %>% 
  group_by(Species) %>%
  group_modify(~ .x %>%
     summarise(out = list(pairwise.t.test(value, name) %>% 
       tidy))) %>% 
   ungroup %>% 
   unnest(out)

-輸出

ttest_out
# A tibble: 18 × 4
   Species    group1       group2         p.value
   <fct>      <chr>        <chr>            <dbl>
 1 setosa     Petal.Width  Petal.Length 1.77e- 54
 2 setosa     Sepal.Length Petal.Length 2.77e-132
 3 setosa     Sepal.Length Petal.Width  1.95e-156
 4 setosa     Sepal.Width  Petal.Length 1.61e- 86
 5 setosa     Sepal.Width  Petal.Width  1.13e-123
 6 setosa     Sepal.Width  Sepal.Length 4.88e- 71
 7 versicolor Petal.Width  Petal.Length 5.35e- 90
 8 versicolor Sepal.Length Petal.Length 3.78e- 52
 9 versicolor Sepal.Length Petal.Width  5.02e-125
10 versicolor Sepal.Width  Petal.Length 1.36e- 45
11 versicolor Sepal.Width  Petal.Width  3.46e- 44
12 versicolor Sepal.Width  Sepal.Length 1.25e- 95
13 virginica  Petal.Width  Petal.Length 1.39e- 90
14 virginica  Sepal.Length Petal.Length 6.67e- 22
15 virginica  Sepal.Length Petal.Width  3.47e-110
16 virginica  Sepal.Width  Petal.Length 2.35e- 68
17 virginica  Sepal.Width  Petal.Width  1.87e- 19
18 virginica  Sepal.Width  Sepal.Length 2.47e- 92

uj5u.com熱心網友回復：

更新：此問題的統計部分（適用pairwise.t.test于 iris 資料集）之前已在 SO 上回答。這是另一個解決方案。

公認的解決方案運行一系列成對的 t 檢驗并產生一列 p 值（正如問題所要求的那樣），但它不是一組非常有意義的 t 檢驗。您可能會懷疑，我們看到像 2.77e-132 這樣的 p 值，并且 group1 和 group2 是連續變數，而不是因子的水平。

這些測驗評估的假設是，對于每個物種，萼片是否與花瓣相同，長度是否與寬度相同。成對 t 檢驗程式旨在比較單個連續變數（例如萼片長度）跨所有水平的因子（例如物種）。

首先，讓我們應用pairwise.t.test到 column Sepal.Length，以便我們稍后可以檢查我們是否獲得了正確的 p 值。

library("broom")
library("tidyverse")

pairwise.t.test(iris$Sepal.Width, iris$Species)
#> 
#>  Pairwise comparisons using t tests with pooled SD 
#> 
#> data:  iris$Sepal.Width and iris$Species 
#> 
#>            setosa  versicolor
#> versicolor < 2e-16 -         
#> virginica  9.1e-10 0.0031    
#> 
#> P value adjustment method: holm

如果您曾經看過 iris 資料集，您就會知道這些 p 值“很有意義”：Virginica 和 Versicolor 彼此之間的相似性高于 Setosa。

所以現在讓我們以整潔的方式將測驗應用于四個數字列。

t_pvals <- iris %>%
  pivot_longer(
    -Species,
    names_to = "Variable",
    values_to = "x"
  ) %>%
  # The trick to performing the right tests is to group the tibble by Variable,
  # not by Species because Species is the grouping variable for the t-tests.
  group_by(
    Variable
  ) %>%
  group_modify(
    ~ tidy(pairwise.t.test(.x$x, .x$Species))
  ) %>%
  ungroup()
t_pvals
#> # A tibble: 12 × 4
#>    Variable     group1     group2      p.value
#>    <chr>        <chr>      <chr>         <dbl>
#>  1 Petal.Length versicolor setosa     1.05e-68
#>  2 Petal.Length virginica  setosa     1.23e-90
#>  3 Petal.Length virginica  versicolor 1.81e-31
#>  4 Petal.Width  versicolor setosa     2.51e-57
#>  5 Petal.Width  virginica  setosa     2.39e-85
#>  6 Petal.Width  virginica  versicolor 8.82e-37
#>  7 Sepal.Length versicolor setosa     1.75e-15
#>  8 Sepal.Length virginica  setosa     6.64e-32
#>  9 Sepal.Length virginica  versicolor 2.77e- 9
#> 10 Sepal.Width  versicolor setosa     5.50e-17
#> 11 Sepal.Width  virginica  setosa     9.08e-10
#> 12 Sepal.Width  virginica  versicolor 3.15e- 3

Sepal.Width 比較的 p 值位于底部。我們得到了預期的 p 值！

接下來我們對 p 值進行格式化，以使它們更易于觀察。

t_pvals <- t_pvals %>%
  mutate(
    across(
      p.value, rstatix::p_format,
      accuracy = 0.05
    )
  )
t_pvals
#> # A tibble: 12 × 4
#>    Variable     group1     group2     p.value
#>    <chr>        <chr>      <chr>      <chr>  
#>  1 Petal.Length versicolor setosa     <0.05  
#>  2 Petal.Length virginica  setosa     <0.05  
#>  3 Petal.Length virginica  versicolor <0.05  
#>  4 Petal.Width  versicolor setosa     <0.05  
#>  5 Petal.Width  virginica  setosa     <0.05  
#>  6 Petal.Width  virginica  versicolor <0.05  
#>  7 Sepal.Length versicolor setosa     <0.05  
#>  8 Sepal.Length virginica  setosa     <0.05  
#>  9 Sepal.Length virginica  versicolor <0.05  
#> 10 Sepal.Width  versicolor setosa     <0.05  
#> 11 Sepal.Width  virginica  setosa     <0.05  
#> 12 Sepal.Width  virginica  versicolor <0.05

最后我們將結果保存到檔案中。

t_pvals %>%
  write_csv("pairwse-t-tests-on-iris-data.csv")

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/443976.html

標籤：r 擅长列表数据框分裂

上一篇：從Excel到SQL實作SUMIF()函式

下一篇：Excel，如何動態修改列單元格直到下一個不同的值