我有創建所需輸出的代碼;但是,它非常緩慢。我有兩個輸入資料集(metaClustering_perCell,data_clean)。data_clean 的每一行索引對應于metaClustering_per單元格的索引位置。這是兩個資料集的示例。
dput(head(data_clean[1:5],10))
structure(
list(
`NA` = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
EGFP.A = c(326, 314, 341, 0, 198, 295, 325, 309, 400, 328),
CD43.PE.A = c(435, 402, 469, 283, 303, 371, 442, 363, 444, 358),
CD45.PE.Vio770.A = c(399, 385, 379, 438, 384, 331, 402, 392, 354, 430),
CD235a_41a.APC.A = c(412, 618, 239, 562, 661, 193, 363, 385, 408, 265),
APC.Vio770.A = c(447, 491, 444, 437, 477, 328, 453, 326, 353, 0)
),
row.names = c(NA, -10L),
class = "data.frame"
)
| 不適用 | EGFP.A | CD43.PE.A | CD45.PE.Vio770.A | CD235a_41a.APC.A | APC.Vio770.A |
|---|---|---|---|---|---|
| 1 | 326 | 435 | 399 | 412 | 447 |
| 2 | 314 | 402 | 385 | 618 | 491 |
| 3 | 341 | 469 | 379 | 239 | 444 |
| 4 | 0 | 283 | 438 | 562 | 437 |
| 5 | 198 | 303 | 384 | 661 | 477 |
| 6 | 295 | 371 | 331 | 193 | 328 |
| 7 | 325 | 442 | 402 | 363 | 453 |
| 8 | 309 | 363 | 392 | 385 | 326 |
| 9 | 400 | 444 | 354 | 408 | 353 |
| 10 | 328 | 358 | 430 | 265 | 0 |
dput(head(metaClustering_perCell,10))
c("1 Population", "1 Population", "1 Population", "1 Population", "1 Population",
"1 Population", "1 Population", "1 Population", "1 Population", "9 Population")
我希望最終使用標記的平均值(EGFP.A、CD43.PE.A .....)制作熱圖,但是,我的資料集將包含近 2e8 個細胞,這些細胞被分類到預定數量的群體中。我撰寫的代碼顯示在這里,它創建了 2 個空資料幀。df_sum 存盤標記(EGFP.A、CD43.PE.A .....)的運行總和,而 df_count 對每個群體中的總事件進行運行統計。最終,代碼通過將資料幀除以向量來取平均值。代碼在這里。
# create an empty matrix
df_sum <- data.frame(matrix(ncol = length(data_clean), nrow = num_clusters))
pops_header <- unique(metaClustering_perCell)
rownames(df_sum) <- pops_header
colnames(df_sum) <- colnames(data_clean)
# creates empty table for storing the count values
df_count <- data.frame(matrix(ncol = num_clusters, nrow = 1))
colnames(df_count) <- pops_header
df[is.na(df_sum)] <- 0
df_count[is.na(df_count)] <- 0
for (i in 1:length(metaClustering_perCell)){
# only takes one row at a time of original data
volt_vals <- data_clean[i,]
# find the column to place it in (population)
pop <- metaClustering_perCell[i]
# Tally for each population
df_count[1,pop] <- df_count[1,pop] 1
# adds to the previous value in the dataframe
for (a in colnames(volt_vals)){
df_sum[pop, a] <- volt_vals[a] df_sum[pop, a]
}
# creates another dataframe same size as df to overwrite with the averages
df_aves <- df_sum
# Divide the df_=
for (n in pops_header){
df_aves[n,] <- mapply('/', df_sum[n,], df_count[n])
}
}
我得到的輸出是這個(我把它們剪掉以便更容易看到)
>head(df_sum[1:3],10)
| 不適用 | EGFP.A | CD43.PE.A | CD45.PE.Vio770.A |
|---|---|---|---|
| 1 人口 | 26062897 | 35936578 | 32784372。 |
| 9 人口 | 1045468 | 1591084 | 1576716。 |
| 2 人口 | 4374137 | 8673145 | 6555053。 |
| 8 人口 | 818413 | 44836 | 1318176。 |
| 5 人口 | 217605 | 443341 | 439357。 |
| 6 人口 | 1056157 | 1558711 | 43206。 |
| 7 人口 | 747037 | 883763 | 1134664. |
| 3 Population | 1561994 | 2376586 | 2329772. |
| 4 Population | 54940 | 9346 | 137085. |
| 10 Population | 172735 | 213079 | 8043. |
>head(df_count[1:5])
| Population 9 | Population 2 | Population 8 | Population 5 | Population |
|---|---|---|---|---|
| 78909 | 4262 | 12982 | 4447 | 1392 |
> head(df_aves[1:3], 10)
| NA | EGFP.A | CD43.PE.A | CD45.PE.Vio770.A |
|---|---|---|---|
| 1 Population | 330.2905 | 455.41799 | 415.470631 |
| 9 Population | 245.2999 | 373.31863 | 369.947443 |
| 2 Population | 336.9386 | 668.09005 | 504.933986 |
| 8 Population | 184.0371 | 10.08230 | 296.419159 |
| 5 Population | 156.3254 | 318.49210 | 315.630029 |
| 6 Population | 235.1195 | 346.99711 | 9.618433 |
| 7 Population | 186.1079 | 220.17015 | 282.676632 |
| 3 Population | 256.1906 | 389.79597 | 382.117763 |
| 4 Population | 160.1749 | 27.24781 | 399.664723 |
| 10 Population | 201.5578 | 248.63361 | 9.385064 |
The data frame of averages of each population and their values for each of the column headers(markers) is exactly what I want..... however, it is brutally slow.... and I mean brutal. This is my first week with R (I come knowing self taught python from the stacks), so please explain thoroughly. Thanks for the help.
uj5u.com熱心網友回復:
目前尚不清楚您要實作的具體目標,并且示例資料太稀疏而無法幫助消除歧義,但這是我的兩個猜測:
每個群體中每個標記的平均值
這種解釋與您的樣本輸出最為一致,其中每個總體(集群)僅出現一次,就好像資料是按總體聚合的一樣。
在 R 中,對資料進行分組然后用聚合函式對其進行匯總非常簡單。
解決方案 1.1: dplyr
這是一個dplyr包的解決方案,它在語法上很直觀:
library(dplyr)
data_clean %>%
# Overwrite the 'NA' column with the cluster labels.
mutate(`NA` = metaClustering_perCell) %>%
# Group by cluster labels...
group_by(`NA`) %>%
# ...and summarize the average of each marker (column).
summarize(across(everything(), mean))
解決方案 1.2: data.table
這是一個解決方案data.table,它提供了更好的性能。
library(data.table)
as.data.table(data_clean)[,
# Overwrite the 'NA' column with the cluster labels.
("NA") := metaClustering_perCell
][,
# Summarize the average of each marker (column), as grouped by cluster.
lapply(.SD, mean), by = `NA`
]
結果
我們的價值觀data_clean,并metaClustering_perCell在你的問題來作為采樣。
第一個結果 ( 1.1 ) 將是 a tibble,第二個 ( 1.2 ) adata.table將包含以下資料:
NA EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667 390.2222 384.8889 426.7778 417.3333
9 Population 328.0000 358.0000 430.0000 265.0000 0.0000
每次觀察的累積平均值 ("")
這種解釋與您的演算法最一致,該演算法似乎在運行的基礎上為每個觀察(行)計算其指標(平均值等)。
R 還有助于累積平均值、求和等。這是迄今為止更有效的利用矢量運算,而不是反復地計算這些指標(與環路,*apply()家庭等)的每一行。
解決方案 2.1: dplyr
巧合的是,dplyr已經有了自己的cummean()功能。
library(dplyr)
data_clean %>%
# Overwrite the 'NA' column with the cluster labels.
mutate(`NA` = metaClustering_perCell) %>%
# Group by cluster labels...
group_by(`NA`) %>%
# ...and overwrite each marker (column) with its running average.
mutate(across(everything(), cummean)) %>% ungroup()
解決方案 2.2: data.table
隨著data.table我們能湊合我們自己的(匿名)函式
function(x) {
cumsum(x) / seq_along(x)
}
它將運行總和除以運行計數,以計算沿向量(列)的累積平均值。我們也可以匯入dplyr和使用cummean來代替我們的函式。
library(data.table)
as.data.table(data_clean)[,
# Overwrite the 'NA' column with the cluster labels.
("NA") := metaClustering_perCell
][,
# Overwrite each marker (column) with its running average, as grouped by cluster.
lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`
]
結果
我們的價值觀data_clean,并metaClustering_perCell在你的問題來作為采樣。
第一個結果 ( 1.1 ) 將是 a tibble,第二個 ( 1.2 ) adata.table將包含以下資料:
NA EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000 435.0000 399.0000 412.0000 447.0000
1 Population 320.0000 418.5000 392.0000 515.0000 469.0000
1 Population 327.0000 435.3333 387.6667 423.0000 460.6667
1 Population 245.2500 397.2500 400.2500 457.7500 454.7500
1 Population 235.8000 378.4000 397.0000 498.4000 459.2000
1 Population 245.6667 377.1667 386.0000 447.5000 437.3333
1 Population 257.0000 386.4286 388.2857 435.4286 439.5714
1 Population 263.5000 383.5000 388.7500 429.1250 425.3750
1 Population 278.6667 390.2222 384.8889 426.7778 417.3333
9 Population 328.0000 358.0000 430.0000 265.0000 0.0000
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/409497.html
標籤:
