我有創建所需輸出的代碼；但是，它非常緩慢。我有兩個輸入資料集（metaClustering_perCell，data_clean）。data_clean 的每一行索引對應于metaClustering_per單元格的索引位置。這是兩個資料集的示例。

dput(head(data_clean[1:5],10))

structure(
  list(
    `NA` = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    EGFP.A = c(326, 314, 341, 0, 198, 295, 325, 309, 400, 328),
    CD43.PE.A = c(435, 402, 469, 283, 303, 371, 442, 363, 444, 358),
    CD45.PE.Vio770.A = c(399, 385, 379, 438, 384, 331, 402, 392, 354, 430),
    CD235a_41a.APC.A = c(412, 618, 239, 562, 661, 193, 363, 385, 408, 265),
    APC.Vio770.A = c(447, 491, 444, 437, 477, 328, 453, 326, 353, 0)
  ),
  row.names = c(NA, -10L),
  class = "data.frame"
)

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

dput(head(metaClustering_perCell,10))

c("1 Population", "1 Population", "1 Population", "1 Population", "1 Population",
"1 Population", "1 Population", "1 Population", "1 Population", "9 Population")

我希望最終使用標記的平均值（EGFP.A、CD43.PE.A .....）制作熱圖，但是，我的資料集將包含近 2e8 個細胞，這些細胞被分類到預定數量的群體中。我撰寫的代碼顯示在這里，它創建了 2 個空資料幀。df_sum 存盤標記（EGFP.A、CD43.PE.A .....）的運行總和，而 df_count 對每個群體中的總事件進行運行統計。最終，代碼通過將資料幀除以向量來取平均值。代碼在這里。

# create an empty matrix
df_sum  <- data.frame(matrix(ncol = length(data_clean), nrow = num_clusters))
pops_header <- unique(metaClustering_perCell)
rownames(df_sum) <- pops_header
colnames(df_sum) <- colnames(data_clean)

# creates empty table for storing the count values
df_count <- data.frame(matrix(ncol = num_clusters, nrow = 1))
colnames(df_count) <- pops_header



df[is.na(df_sum)] <- 0
df_count[is.na(df_count)] <- 0



for (i in 1:length(metaClustering_perCell)){

  # only takes one row at a time of original data
  volt_vals <- data_clean[i,]
  
  # find the column to place it in (population)
  pop <- metaClustering_perCell[i]
  
  # Tally for each population
  df_count[1,pop] <- df_count[1,pop]   1
  
  # adds to the previous value in the dataframe
  for (a in colnames(volt_vals)){
    df_sum[pop, a] <- volt_vals[a]   df_sum[pop, a]
  }
    
  # creates another dataframe same size as df to overwrite with the averages
  df_aves <- df_sum
  
  
  # Divide the df_=
  for (n in pops_header){
    df_aves[n,] <- mapply('/', df_sum[n,], df_count[n])
  }
}

我得到的輸出是這個（我把它們剪掉以便更容易看到）

>head(df_sum[1:3],10)

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A
1 人口	26062897	35936578	32784372。
9 人口	1045468	1591084	1576716。
2 人口	4374137	8673145	6555053。
8 人口	818413	44836	1318176。
5 人口	217605	443341	439357。
6 人口	1056157	1558711	43206。
7 人口	747037	883763	1134664.
3 Population	1561994	2376586	2329772.
4 Population	54940	9346	137085.
10 Population	172735	213079	8043.

>head(df_count[1:5])

Population 9	Population 2	Population 8	Population 5	Population
78909	4262	12982	4447	1392

> head(df_aves[1:3], 10)

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A
1 Population	330.2905	455.41799	415.470631
9 Population	245.2999	373.31863	369.947443
2 Population	336.9386	668.09005	504.933986
8 Population	184.0371	10.08230	296.419159
5 Population	156.3254	318.49210	315.630029
6 Population	235.1195	346.99711	9.618433
7 Population	186.1079	220.17015	282.676632
3 Population	256.1906	389.79597	382.117763
4 Population	160.1749	27.24781	399.664723
10 Population	201.5578	248.63361	9.385064

The data frame of averages of each population and their values for each of the column headers(markers) is exactly what I want..... however, it is brutally slow.... and I mean brutal. This is my first week with R (I come knowing self taught python from the stacks), so please explain thoroughly. Thanks for the help.

uj5u.com熱心網友回復：

目前尚不清楚您要實作的具體目標，并且示例資料太稀疏而無法幫助消除歧義，但這是我的兩個猜測：

每個群體中每個標記的平均值

這種解釋與您的樣本輸出最為一致，其中每個總體（集群）僅出現一次，就好像資料是按總體聚合的一樣。

在 R 中，對資料進行分組然后用聚合函式對其進行匯總非常簡單。

解決方案 1.1： `dplyr`

這是一個dplyr包的解決方案，它在語法上很直觀：

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and summarize the average of each marker (column).
  summarize(across(everything(), mean))

解決方案 1.2： `data.table`

這是一個解決方案data.table，它提供了更好的性能。

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Summarize the average of each marker (column), as grouped by cluster.
  lapply(.SD, mean), by = `NA`
]

結果

我們的價值觀data_clean，并metaClustering_perCell在你的問題來作為采樣。

第一個結果 ( 1.1 ) 將是 a tibble，第二個 ( 1.2 ) adata.table將包含以下資料：

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

每次觀察的累積平均值 ("")

這種解釋與您的演算法最一致，該演算法似乎在運行的基礎上為每個觀察（行）計算其指標（平均值等）。

R 還有助于累積平均值、求和等。這是迄今為止更有效的利用矢量運算，而不是反復地計算這些指標（與環路，*apply()家庭等）的每一行。

解決方案 2.1： `dplyr`

巧合的是，dplyr已經有了自己的cummean()功能。

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and overwrite each marker (column) with its running average.
  mutate(across(everything(), cummean)) %>% ungroup()

解決方案 2.2： `data.table`

隨著data.table我們能湊合我們自己的（匿名）函式

function(x) {
  cumsum(x) / seq_along(x)
}

它將運行總和除以運行計數，以計算沿向量（列）的累積平均值。我們也可以匯入dplyr和使用cummean來代替我們的函式。

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Overwrite each marker (column) with its running average, as grouped by cluster.
  lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`
]

結果

我們的價值觀data_clean，并metaClustering_perCell在你的問題來作為采樣。

第一個結果 ( 1.1 ) 將是 a tibble，第二個 ( 1.2 ) adata.table將包含以下資料：

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000  435.0000         399.0000         412.0000     447.0000
1 Population 320.0000  418.5000         392.0000         515.0000     469.0000
1 Population 327.0000  435.3333         387.6667         423.0000     460.6667
1 Population 245.2500  397.2500         400.2500         457.7500     454.7500
1 Population 235.8000  378.4000         397.0000         498.4000     459.2000
1 Population 245.6667  377.1667         386.0000         447.5000     437.3333
1 Population 257.0000  386.4286         388.2857         435.4286     439.5714
1 Population 263.5000  383.5000         388.7500         429.1250     425.3750
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/409497.html

標籤：

上一篇：根據數字所在的熊貓區間乘以唯一數字

下一篇：根據最大值從熊貓資料框中提取最小值

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

根據資料框的列值R之一求和并找到資料框行中所有值的平均值

每個群體中每個標記的平均值

解決方案 1.1： dplyr

解決方案 1.2： data.table

結果

每次觀察的累積平均值 ("")

解決方案 2.1： dplyr

解決方案 2.2： data.table

結果

解決方案 1.1： `dplyr`

解決方案 1.2： `data.table`

解決方案 2.1： `dplyr`

解決方案 2.2： `data.table`

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0