我有一個包含主題和年份的資料框,每年只有一個主題。當日期連續時,我想從類別創建時間跨度:
cat <- c("Cat1","Cat1","Cat2","Cat2","Cat2","Cat3","Cat2","Cat2","Cat2")
year <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
df <- data.frame(Cat=cat, Year=year)
# Which looks like the following:
# Cat1 2010
# Cat1 2011
# Cat2 2012
# Cat2 2013
# Cat2 2014
# Cat3 2015
# Cat2 2016
# Cat2 2017
# Cat2 2018
我想要的輸出是這樣的資料幀:
cat <- c("Cat1","Cat2","Cat3","Cat2")
year <- c(2010,2012,2015,2016)
e_year <- c(2011,2014,2015,2018)
df_goal <- data.frame(Cat=cat, Year=year, EYear = e_year)
# Cat Year EYear
# Cat1 2010 2011
# Cat2 2012 2014
# Cat3 2015 2015
# Cat2 2016 2018
我想用回圈來做,但我認為這不是在 R 中做的正確方法。所以我想在我花時間研究那個解決方案之前先問一下。
uj5u.com熱心網友回復:
這是密切相關的兩種計算由組的平均(按組總結),并獲得第一和使用RLE組最后一個值,雖然每個略有不同。
基數R
out <- aggregate(Year ~ Cat grp, data = df, FUN = range)
out <- do.call(cbind.data.frame, out[,-2])
names(out)[2:3] <- c("Year", "EYear")
out
# Cat Year EYear
# 1 Cat1 2010 2011
# 2 Cat2 2012 2014
# 3 Cat3 2015 2015
# 4 Cat2 2016 2018
dplyr
library(dplyr)
df %>%
group_by(grp = cumsum(Cat != lag(Cat, default = ""))) %>%
summarize(Cat = Cat[1], EYear = max(Year), Year = min(Year)) %>%
ungroup() %>%
select(-grp)
# # A tibble: 4 x 3
# Cat EYear Year
# <chr> <dbl> <dbl>
# 1 Cat1 2011 2010
# 2 Cat2 2014 2012
# 3 Cat3 2015 2015
# 4 Cat2 2018 2016
資料表
library(data.table)
as.data.table(df)[, .(Cat = Cat[1], EYear = max(Year), Year = min(Year)), by = .(grp = rleid(Cat))
][, grp := NULL]
# Cat EYear Year
# <char> <num> <num>
# 1: Cat1 2011 2010
# 2: Cat2 2014 2012
# 3: Cat3 2015 2015
# 4: Cat2 2018 2016
uj5u.com熱心網友回復:
與dplyr:
df %>% group_by(Cat, N=cumsum(Cat != lag(Cat, default=""))) %>%
summarize(SYear=min(Year), EYear=max(Year)) %>%
arrange(N) %>% select(-N)
輸出:
Cat SYear EYear
<chr> <dbl> <dbl>
1 Cat1 2010 2011
2 Cat2 2012 2014
3 Cat3 2015 2015
4 Cat2 2016 2018
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/318545.html
上一篇:在R函式ggplotly(來自plotly包)中,如何調整標簽內容
下一篇:資料框中各行的累計總和
