我正在尋找一個 R 包來將連續的日期分組到句點中。此外,列必須按 FID、PID 和 SETTING 分組:
# Input data
input <- read.csv(text=
"FID,PID,SETTING,DATE
00001, 100001, ST, 2021-01-01
00001, 100001, ST, 2021-01-02
00001, 100001, ST, 2021-01-03
00001, 100002, AB, 2021-01-04
00001, 100001, ST, 2021-01-11
00001, 100001, ST, 2021-01-12
00002, 200001, AB, 2021-01-02
00002, 200001, AB, 2021-01-03
00002, 200001, AB, 2021-01-04
00002, 200002, TK, 2021-01-05"
)
# Expected output
output <- read.csv(text="
FID,PID,SETTING,START,END
00001, 100001, ST, 2021-01-01, 2021-01-03
00001, 100002, AB, 2021-01-04, 2021-01-04
00001, 100001, ST, 2021-01-11, 2021-01-12
00002, 200001, AB, 2021-01-02, 2021-01-04
00002, 200002, TK, 2021-01-05, 2021-01-05"
)
我必須對大約 700'000 行進行分組。因此,該解決方案應盡可能提高性能。
uj5u.com熱心網友回復:
基數R
input <- input[order(input$DATE),]
input$grp <- ave(as.integer(input$DATE), input[-4], FUN = function(z) cumsum(c(TRUE, diff(z) > 1)))
input
# FID PID SETTING DATE grp
# 1 1 100001 ST 2021-01-01 1
# 2 1 100001 ST 2021-01-02 1
# 7 2 200001 AB 2021-01-02 1
# 3 1 100001 ST 2021-01-03 1
# 8 2 200001 AB 2021-01-03 1
# 4 1 100002 AB 2021-01-04 1
# 9 2 200001 AB 2021-01-04 1
# 10 2 200002 TK 2021-01-05 1
# 5 1 100001 ST 2021-01-11 1
# 6 1 100001 ST 2021-01-12 1
out <- aggregate(DATE ~ FID PID SETTING grp, data = input,
FUN = function(z) setNames(range(z), c("START","END")))
out <- do.call(data.frame, out)
out[,5:6] <- lapply(out[,5:6], as.Date, origin = "1970-01-01")
out
# FID PID SETTING grp DATE.START DATE.END
# 1 1 100002 AB 1 2021-01-04 2021-01-04
# 2 2 200001 AB 1 2021-01-02 2021-01-04
# 3 1 100001 ST 1 2021-01-01 2021-01-03
# 4 2 200002 TK 1 2021-01-05 2021-01-05
# 5 1 100001 ST 2 2021-01-11 2021-01-12
演練:
- 的方便
cumsum和diff在完成假定日期總是有序的; 其他分組變數可能被錯誤排序并不重要(這里); ave(..)分配一組非連續(差異超過 1)日期,我們在下一步中使用;aggregate使用您的三個變數加上我們新的grp分組變數計算每個組內的范圍;z匿名函式中的每一個都是一個連續的日期向量,所以range給了我們開始/結束日期;- 不幸的是,聚合將矩陣分配為第五列而不是兩個單獨的列,因此
do.call(data.frame, out)解決了這個問題; - 不幸的是,大多數基礎 R 聚合函式傾向于從向量中剝離
Date(andPOSIXt) 類,因此我們需要使用它as.Date來修復它。
dplyr
library(dplyr)
input %>%
arrange(DATE) %>%
group_by(FID, PID, SETTING) %>%
mutate(grp = cumsum(c(TRUE, diff(DATE) > 1))) %>%
group_by(FID, PID, SETTING, grp) %>%
summarize(START = min(DATE), END = max(DATE)) %>%
ungroup()
# # A tibble: 5 x 6
# FID PID SETTING grp START END
# <int> <int> <chr> <int> <date> <date>
# 1 1 100001 " ST" 1 2021-01-01 2021-01-03
# 2 1 100001 " ST" 2 2021-01-11 2021-01-12
# 3 1 100002 " AB" 1 2021-01-04 2021-01-04
# 4 2 200001 " AB" 1 2021-01-02 2021-01-04
# 5 2 200002 " TK" 1 2021-01-05 2021-01-05
資料表
library(data.table)
inputDT <- as.data.table(input)
setorder(inputDT, DATE)
inputDT[, grp := cumsum(c(TRUE, diff(DATE) > 1)), by = .(FID, PID, SETTING)
][, .(START = min(DATE), END = max(DATE)), by = .(FID, PID, SETTING, grp)
][]
# FID PID SETTING grp START END
# <int> <int> <char> <int> <Date> <Date>
# 1: 1 100001 ST 1 2021-01-01 2021-01-03
# 2: 2 200001 AB 1 2021-01-02 2021-01-04
# 3: 1 100002 AB 1 2021-01-04 2021-01-04
# 4: 2 200002 TK 1 2021-01-05 2021-01-05
# 5: 1 100001 ST 2 2021-01-11 2021-01-12
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/331443.html
