使用dplyr計算R中的條件累積和。當前日期之前的所有日期-有解無憂

我正在尋找一種在 R 中使用累積總和的方法，條件是不包括當前日期。

我有以下資料框（它是真實資料框的子集和簡化版本）：

df <- structure(list(date_time = structure(c(1609513200, 1609513200, 1609513200,
  1609516800, 1609516800, 1609516800, 1609599600, 1609599600, 1609599600, 
  1609603200, 1609603200, 1609603200), tzone = "UTC", class = c("POSIXct", 
  "POSIXt")), event = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), 
  person = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), 
  did_attend = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L), 
  events_attended = c(0, 0, 0, 1, 1, 1, 2, 2, 1, 2, 3, 2), 
  events_attended_desired = c(0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 1L, 2L, 2L, 1L)), 
  class = c("grouped_df", "tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -12L), groups = structure(list(person = c("A", "B", "C"),
  .rows = structure(list(c(1L, 4L, 7L, 10L), c(2L, 5L, 8L, 11L), 
  c(3L, 6L, 9L, 12L)), ptype = integer(0), 
  class = c("vctrs_list_of", "vctrs_vctr", "list"))), 
  class = c("tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -3L), .drop = TRUE))
 
 df
 ## date_time           event person did_attend events_attended events_attended_desired
 ## 2021-01-01 15:00:00     1 A               1               0                       0
 ## 2021-01-01 15:00:00     1 B               1               0                       0
 ## 2021-01-01 15:00:00     1 C               1               0                       0
 ## 2021-01-01 16:00:00     2 A               1               1                       0
 ## 2021-01-01 16:00:00     2 B               1               1                       0
 ## 2021-01-01 16:00:00     2 C               0               1                       0
 ## 2021-01-02 15:00:00     1 A               0               2                       2
 ## 2021-01-02 15:00:00     1 B               1               2                       2
 ## 2021-01-02 15:00:00     1 C               1               1                       1
 ## 2021-01-02 16:00:00     2 A               1               2                       2
 ## 2021-01-02 16:00:00     2 B               0               3                       2
 ## 2021-01-02 16:00:00     2 C               1               2                       1

“did_attend”列是一個虛擬變數，表示一個人是否參加了活動。“events_attended”列顯然是由

events <- events %>% 
  arrange(date_time) %>% 
  group_by(person) %>% 
  mutate(events_attended = lag(cumsum(did_attend), default = 0)) %>% 
  ungroup()

現在我正在尋找一種不包括當前日期的事件的方法，因此累積總和應該只對當前日期之前的日期求和（所需的輸出在 events_attended_desired 列中）。每天有幾個事件，每天的事件數量不同。所以滯后版本不起作用。我在 cumsum 函式中嘗試了幾個 ifelse() 但它們也不起作用，因為我不知道如何比較 cumsum() 中 ifelse 子句中的日期

uj5u.com熱心網友回復：

這是一種使用dplyrand的方法lubridate::floor_date。

首先，我在資料框中添加了一個“日期”列，以便我可以根據日期進行匯總和連接。

然后我將這個表加入到它自己的一個總結版本中。count(date, wt = did_attend)是的捷徑group_by(date) %>% summarize(n = sum(did_attend))，所以如果我再考慮它的滯后，我們就會得到想要的結果。

df2 <- df %>%
  mutate(date = lubridate::floor_date(date_time, "day"))

df2 %>%
  left_join(
    df2 %>% 
      count(date, wt = did_attend) %>%
      mutate(prior_attended = cumsum(lag(n, default = 0))) %>%
      select(-n)
  )

uj5u.com熱心網友回復：

如果每個數字對應于先前日期，則將每個數字乘以 1，否則乘以 0。

 library(dplyr)
 df %>% 
   mutate(events_attended = sapply(as.Date(date_time), 
      function(x) sum((as.Date(date_time) < x) * did_attend))) %>%
   arrange(date_time) %>%
   ungroup

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/351306.html

標籤：r dplyr

上一篇：使用蒙特卡羅方法從密度核計算均值和方差

下一篇：在R中運行求和時合并兩個data.tables