我有一個資料框,其中有一列指向下一條記錄,即下面的示例資料框。
OG_Data <- tibble(
Record = c("aaaa", "NNNN", "rrrr", "tttt", "pppp", "ssss", "bbbb"),
NextRecord = c("pppp", "tttt", "bbbb", "N/A" , "NNNN", "rrrr", "N/A")
)
原始資料:
# Record NextRecord
# aaaa pppp
# NNNN tttt
# rrrr bbbb
# tttt N/A
# pppp NNNN
# ssss rrrr
# bbbb N/A
我希望根據指向下一條記錄的 A 列(記錄)的 B 列(NextRecord)確定的預定義序列對該資料框進行排序,以獲取序列順序和行組。
期望的輸出:
# Record NextRecord Sequence Line
# aaaa pppp 1 1
# pppp NNNN 2 1
# NNNN tttt 3 1
# tttt N/A 4 1
# ssss rrrr 1 2
# rrrr bbbb 2 2
# bbbb N/A 3 2
我在想這樣的事情:
OutputData <- OG_Data[1,] %>% add_row(OG_Data, filter(OG_Data, OG_Data$Record == NextRecord))
但這不起作用并且不可擴展。另外,我不確定從哪里開始找到線組的開頭。
uj5u.com熱心網友回復:
我敢打賭有更簡單的方法,但至少將其作為圖形問題來處理很有趣。
library(igraph)
g = graph_from_data_frame(d)
g2 = sapply(V(g)[degree(g, mode = 'in') == 0], function(v) all_simple_paths(g, v, "N/A"))
d2 = d[unlist(lapply(g2, function(v) head(as.vector(v), -1))),]
d2$Line = rep(seq_along(g2), lengths(g2) - 1)
Record NextRecord Line
1 aaaa pppp 1
5 pppp NNNN 1
2 NNNN tttt 1
4 tttt N/A 1
6 ssss rrrr 2
3 rrrr bbbb 2
7 bbbb N/A 2
然后
g2
# $aaaa
# 5/8 vertices, named, from b21b8d2:
# [1] aaaa pppp NNNN tttt N/A
# $ssss
# 4/8 vertices, named, from b21b8d2:
# [1] ssss rrrr bbbb N/A
uj5u.com熱心網友回復:
cumsum
和lag
:_
library(dplyr)
OG_Data %>%
mutate(NextRecord = na_if(NextRecord, "N/A"),
Line = cumsum(lag(is.na(NextRecord), default = T))) %>%
group_by(Line) %>%
mutate(Sequence = row_number())
輸出
Record NextRecord Line Sequence
<chr> <chr> <int> <int>
1 aaaa pppp 1 1
2 NNNN tttt 1 2
3 rrrr bbbb 1 3
4 tttt NA 1 4
5 pppp NNNN 2 1
6 ssss rrrr 2 2
7 bbbb NA 2 3
uj5u.com熱心網友回復:
一種快速且可擴展的方法:
library(data.table)
seqGroups <- function(firstSeq, nextMatch) {
idxOut <- seqOut <- lineOut <- integer(length(nextMatch))
irow <- 0L
for (i in seq_along(firstSeq)) {
idxOut[irow <- irow 1L] <- firstSeq[i]
seqOut[irow] <- 1L
lineOut[irow] <- i
while (nextMatch[idxOut[irow]]) {
idxOut[irow <- irow 1L] <- nextMatch[idxOut[irow]]
seqOut[irow] <- seqOut[irow - 1L] 1L
lineOut[irow] <- i
}
}
list(idx = idxOut, seqLine = list(seqOut, lineOut))
}
with(
with(
OG_Data,
seqGroups(which(!Record %chin% NextRecord), match(NextRecord, Record, 0L))
),
setDT(OG_Data)[idx][, c("Sequence", "Line") := seqLine]
)[]
#> Record NextRecord Sequence Line
#> 1: aaaa pppp 1 1
#> 2: pppp NNNN 2 1
#> 3: NNNN tttt 3 1
#> 4: tttt N/A 4 1
#> 5: ssss rrrr 1 2
#> 6: rrrr bbbb 2 2
#> 7: bbbb N/A 3 2
為更大的表計時:
OG_Data <- data.table(
Record = paste0(rep(c("aaaa", "NNNN", "rrrr", "tttt", "pppp", "ssss", "bbbb"), 1e6), rep(1:1e6, each = 7)),
NextRecord = paste0(rep(c("pppp", "tttt", "bbbb", "N/A" , "NNNN", "rrrr", "N/A"), 1e6), rep(1:1e6, each = 7))
)
OG_Data$NextRecord[c(seq(4, 7e6, 7), seq(7, 7e6, 7))] <- "N/A"
system.time({
with(
with(
OG_Data,
seqGroups(which(!Record %chin% NextRecord), match(NextRecord, Record, 0L))
),
OG_Data[idx][, c("Sequence", "Line") := seqLine]
)
})
#> user system elapsed
#> 1.96 0.10 2.06
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/505745.html
上一篇:洗掉特定的重復值并保持對角線
下一篇:如何將串列項附加到回傳函式