如何按R中63,000 行資料集的頻率對前10名進行子集化-有解無憂

我有一個分析資料集，格式如下：

| Compound | Concentration |SampleID|

有 200 個樣品 ID，其中 9000 個獨特的化合物給出了 63,000 行的 df。（并非每種化合物都存在于每個樣品中）

我想做的是獲取十種最常出現的化合物并創建一個子集，這樣我就可以使用箱線圖或類似方法繪制它們的濃度

我試過使用下面的，但這會導致錯誤，并且只過濾前十名（所以它們都是相同的化合物）

df %>% 
  arrange(desc(df$Concentration)) %>%
  slice(1:10, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration)) 
  geom_point() 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  labs(x="Compound", y="Frequency")

我的另一個想法是

  arrange(desc(as.data.frame(table(df$Compound)))) %>%
  slice(1:50, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration)) 
  geom_point() 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  labs(x="Compound", y="Frequency")

都不作業。我覺得我需要制作一個包含前 10 名串列的 df，然后過濾我的 df 以提供 dftop，然后將我的 df 子集為這些元素

任何人都可以幫助簡化這個嗎？

uj5u.com熱心網友回復：

此data.table方法識別前 10 名，并使用連接：

library(data.table)
setDT(df)[df[,.N,Compounds][order(-N)][1:10],on="Compounds"]

tidyverse 中的等價物是：

inner_join(df, count(df,Compounds,sort = T)[1:10])

在帶有非正式基準測驗的 63K 資料集上，我發現 data.table 方法的速度大約快了 20 倍。

uj5u.com熱心網友回復：

創建前 10 個化合物的向量，然后根據該向量過濾資料框。使用最常見的家庭世界的插圖dplyr::starwars：

library(dplyr)

top10 <- starwars %>% 
  count(homeworld, sort = TRUE) %>%
  head(10) %>%
  pull(homeworld)

starwars %>%
  filter(homeworld %in% top10)

# A tibble: 48 × 14
   name        height  mass hair_…1 skin_…2 eye_c…3 birth…? sex   gender homew…?
   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
 1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
 2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
 3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
 4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
 5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
 6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
 7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
 8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
 9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
10 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
# … with 38 more rows, 4 more variables: species <chr>, films <list>,
#   vehicles <list>, starships <list>, and abbreviated variable names
#   1?hair_color, 2?skin_color, 3?eye_color, ??birth_year, ??homeworld

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/523445.html

標籤：r数据框ggplot2子集

上一篇：當我在Python中打開tkinter選單欄中的視窗時，是否可以檢查檢查按鈕？

下一篇：如何解決群標題被截斷的問題？