我正在嘗試根據 r 中的日期拆分值。這是我的資料的一個虛擬:
df = data.frame(ID= c('A','B','C'),
year = c('2019','2019','2020'),
start= c('201850','201940','201850'),
end= c('201903','202002','202110'),
value = c(45,14,117))
ID year start end value
A 2019 201850 201903 45
B 2019 201940 202002 14
C 2020 201850 202110 117
我在輸出中需要的是按年-周級別劃分的值。僅在 2019 年之后的幾年內,假設價值在數周內均勻分布。例如虛擬資料(A)的第一行在 2019 年只有 3 周。由于 A 的值是 5 周(201850 到 201903),因此 2019 年的 3 周將等于 27( )。
期望的輸出是:
ID year start end value
A 2019 201901 201903 27
B 2019 201940 201952 12
B 2020 202001 202002 2
C 2019 201901 201952 52
C 2020 202001 202053 53
C 2021 202101 202110 10
uj5u.com熱心網友回復:
我們可以撰寫一個函式,與dplyr::summarise. 它非常冗長,但它正在起作用。您需要指定如何舍入輸出。一個警告是,下面的函式假設一年有 52 周,因此它不會產生有 53 周的年份的確切值。
df = data.frame(ID= c('A','B','C'),
year = c('2019','2019','2020'),
start= c('201850','201940','201850'),
end= c('201903','202002','202110'),
value = c(45,14,117))
library(tidyverse)
library(lubridate)
split_weeks <- function(start, end, value) {
start_year <- as.numeric(str_extract(start, "^[0-9]{4}"))
start_week <- as.numeric(str_extract(start, "[0-9]{2}$"))
end_year <- as.numeric(str_extract(end, "^[0-9]{4}"))
end_week <- as.numeric(str_extract(end, "[0-9]{2}$"))
seq_year <- seq(start_year, end_year)
ln_out <- length(seq_year)
out_start <- vector("integer", length = ln_out)
out_end <- vector("integer", length = ln_out)
out_weight <- vector("integer", length = ln_out)
for (i in seq_len(ln_out)) {
if (i == 1) {
out_start[i] <- start_week
out_end[i] <- if(ln_out > 1) 52L else end_week
out_weight[i] <- out_end[i] - start_week
} else if (i == ln_out) {
out_start[i] <- 1L
out_end[i] <- end_week
out_weight[i] <- out_end[i] - out_start[i] 1L
} else {
out_start[i] <- 1L
out_end[i] <- 52L
out_weight[i] <- out_end[i] - out_start[i] 1L
}
}
out <- tibble(year = seq_year,
start = out_start,
end = out_end,
weight = out_weight,
value = value)
out <- mutate(out,
value = (value * weight / sum(weight)),
across(c(start, end), ~paste0(year, str_pad(.x, 2, pad = "0")))
)
select(out, -weight)
}
df %>%
rowwise(ID) %>%
summarise(split_weeks(start, end, value)) %>%
filter(year != 2018)
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 x 5
#> # Groups: ID [3]
#> ID year start end value
#> <chr> <int> <chr> <chr> <dbl>
#> 1 A 2019 201901 201903 27
#> 2 B 2019 201940 201952 12
#> 3 B 2020 202001 202002 2
#> 4 C 2019 201901 201952 52.4
#> 5 C 2020 202001 202052 52.4
#> 6 C 2021 202101 202110 10.1
由reprex 包(v0.3.0)于 2021 年 12 月 27 日創建
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/394826.html
