使用inner_join()時重復值-有解無憂

我正在嘗試使用 at.test()來比較資料框中多列的方法，其中要比較的值在每列中。每行都有幾列元資料（Date, Assay, Timing）。我的資料如下df所示，其中收集的資料未配對，meas1并且meas2是不相關的不同測量結果。我試圖進行的比較是在每個日期、每個化驗和每個測驗meas1[Timing=="Start"]之間進行比較。meas1[Timing == "End"]我的實際資料有大約 10 列測量資料，這會影響我對某些子集的語法。

library(tidyverse)

df <- data.frame(Date=rep(c("2022-01-01","2022-01-02"), each = 18),
    Assay = rep(c("Gly", "Asp", "Con"), each = 3, times = 4),
    Timing = c(rep("Start",9),rep("End",9)),
    meas1=round(rnorm(36,5,3),0),
    meas2=round(rnorm(36,8,9),0))

我嘗試了幾種不同的方法。一種是嘗試使用元資料inner_join()的pivot_longer()單獨資料框將資料結合在一起，但我沒有得到預期的結果。

comp <- list(Assay = c("Gly","Asp","Con"),
    first = "Start",
    last = "End",
    test = names(df %>% select(-Date,-Assay,-Timing))) %>%
    cross_df()

df_pivot <- df %>%
    pivot_longer(c(-Date,-Assay,-Timing), names_to = "test")

t_tests <- comp %>%
    inner_join(df_pivot, by = c("Assay", "test", "first"="Timing")) %>%
    rename(initial = value) %>%
    inner_join(df_pivot, by = c("Date", "Assay", "test", "last"="Timing")) %>%
    rename(final = value)

t_tests

# A tibble: 108 × 7
   Assay first last  test  Date       initial final
   <chr> <chr> <chr> <chr> <chr>        <dbl> <dbl>
 1 Gly   Start End   meas1 2022-01-01       8     8
 2 Gly   Start End   meas1 2022-01-01       8     9
 3 Gly   Start End   meas1 2022-01-01       8     4
 4 Gly   Start End   meas1 2022-01-01       4     8
 5 Gly   Start End   meas1 2022-01-01       4     9
 6 Gly   Start End   meas1 2022-01-01       4     4
 7 Gly   Start End   meas1 2022-01-01      -1     8
 8 Gly   Start End   meas1 2022-01-01      -1     9
 9 Gly   Start End   meas1 2022-01-01      -1     4
10 Gly   Start End   meas1 2022-01-02       6     1
# … with 98 more rows
# ? Use `print(n = ...)` to see more rows

每個不同的最終值都會重復初始值，這不是我想要的，因為資料沒有配對。我試圖只獲得 36 行：2 個日期、3 個化驗、2 個測驗、每個測驗的 6 個值（3 個值乘 2 列）。換言之，行 1:9 應壓縮為 3 行（第 1、5 和 9 行），僅包含唯一的初始值和最終值。這是我需要幫助的地方。1,5,9 模式應該重復，但我希望避免事后對資料進行切片。

假設該部分已正確完成，我將按如下方式進行，這為我提供了t.test()我想要的結果摘要：


t_tests <- t_tests %>%
    mutate(first = NULL, last = NULL) %>%
    group_by(Date,Assay,test) %>%
    group_modify(~broom::tidy(t.test(.x$initial,.x$final))) %>% ungroup()

t_tests
# A tibble: 12 × 13
   Date       Assay test    estimate estimate1 estimate2 statistic   p.value parameter conf.low conf.high method                  alternative
   <chr>      <chr> <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>                   <chr>      
 1 2022-01-01 Asp   values1   -2.33       5.67     8        -1.79  0.0989        11.7   -5.18       0.511 Welch Two Sample t-test two.sided  
 2 2022-01-01 Asp   values2    9.33       6.67    -2.67      2.14  0.0643         8.16  -0.698     19.4   Welch Two Sample t-test two.sided  
 3 2022-01-01 Con   values1    3.67       6.67     3         2.17  0.0552        10.1   -0.0984     7.43  Welch Two Sample t-test two.sided  
 4 2022-01-01 Con   values2   -8.33       2.67    11        -2.93  0.0110        14.1  -14.4       -2.23  Welch Two Sample t-test two.sided  
 5 2022-01-01 Gly   values1    0.333      5        4.67      0.343 0.737         13.1   -1.76       2.43  Welch Two Sample t-test two.sided  
 6 2022-01-01 Gly   values2   -0.333      5.67     6        -0.100 0.922         11.5   -7.63       6.96  Welch Two Sample t-test two.sided  
 7 2022-01-02 Asp   values1    2          6        4         1.36  0.193         16     -1.12       5.12  Welch Two Sample t-test two.sided  
 8 2022-01-02 Asp   values2   11         11.7      0.667     2.02  0.0731         9.27  -1.26      23.3   Welch Two Sample t-test two.sided  
 9 2022-01-02 Con   values1   -2          4.33     6.33     -1.75  0.0999        15.4   -4.43       0.429 Welch Two Sample t-test two.sided  
10 2022-01-02 Con   values2   11         11.3      0.333     5.64  0.0000761     13.2    6.79      15.2   Welch Two Sample t-test two.sided  
11 2022-01-02 Gly   values1   -2.33       3        5.33     -4.43  0.000594      13.8   -3.47      -1.20  Welch Two Sample t-test two.sided  
12 2022-01-02 Gly   values2    1          6        5         0.267 0.793         14.5   -7.00       9.00  Welch Two Sample t-test two.sided

提前致謝！

uj5u.com熱心網友回復：

您需要在每個日期/測定/時間組中添加一個 run_id，以便您可以匹配將其用作連接標準以避免重復。

有線索，當你說

我試圖只獲得 36 行：2 個日期、3 個化驗、2 個測驗、每個測驗的 6 個值（3 個值乘 2 列）

您有一個帶有 2 個唯一日期的日期列、一個帶有 3 個唯一測定的測定列、一個帶有 2 個唯一測驗的測驗列……您還需要一個帶有 3 個唯一值的列，用于“2 列的 3 個值”。我會打電話給專欄run_id。

我還將跳過comp資料框，本質上是進行自聯接：

pivot2 = df %>%
  group_by(Date, Assay, Timing) %>%
  mutate(run_id = row_number()) %>%
  ungroup() %>%
  pivot_longer(starts_with("meas"), names_to = "test") 

t_tests = 
  full_join(
    filter(pivot2, Timing == "Start") %>% select(-Timing, initial = value),
    filter(pivot2, Timing == "End") %>% select(-Timing, final = value),
    by = c("Date", "Assay", "run_id", "test")
  )
# # A tibble: 36 × 6
#    Date       Assay run_id test  initial final
#    <chr>      <chr>  <int> <chr>   <dbl> <dbl>
#  1 2022-01-01 Gly        1 meas1       1    -1
#  2 2022-01-01 Gly        1 meas2       4     7
#  3 2022-01-01 Gly        2 meas1       0     1
#  4 2022-01-01 Gly        2 meas2      10     8
#  5 2022-01-01 Gly        3 meas1       8     5
#  6 2022-01-01 Gly        3 meas2     -16     4
#  7 2022-01-01 Asp        1 meas1       6     7
#  8 2022-01-01 Asp        1 meas2      28    -5
#  9 2022-01-01 Asp        2 meas1       4     6
# 10 2022-01-01 Asp        2 meas2       9     9
# # … with 26 more rows
# # ? Use `print(n = ...)` to see more rows

我使用 afull_join這樣即使一個資料/分析/計時組合具有不同的運行次數，仍然會包含所有內容。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/513417.html

標籤：rdplyr统计数据

上一篇：為什么`color`會覆寫ggplot中的`position`？

下一篇：為資料框中的每個組創建長度為n的重復數