使用dplyr從起止范圍變數按月匯總計數？-有解無憂

假設我有以這種格式存盤的學校注冊資料，其中包含開始日期和結束日期欄位：

唯一名稱	注冊開始	招生結束
艾米	2017 年 1 月 1 日	2018 年 9 月 30 日
富蘭克林	2017 年 1 月 1 日	2017 年 2 月 19 日
富蘭克林	2017 年 6 月 5 日	2018 年 2 月 4 日
富蘭克林	2018 年 10 月 21 日	2019 年 3 月 9 日
薩米爾	2017 年 6 月 1 日	2017 年 2 月 4 日
薩米爾	2017 年 4 月 5 日	2018 年 9 月 12 日
...	...	...

我想按月生成總入學人數，如下所示：

月	入學人數
2017 年 1 月	25
2017 年 2 月	31
2017 年 3 月	19
2017 年 4 月	34
2017 年 5 月	29
2017 年 6 月	32
...	...

有沒有一種簡單的方法可以使用 dplyr 完成此任務？

我能想到的唯一方法是回圈遍歷從 month_min 到 month_max 范圍內的所有月份的串列，以計算每個月內開始或停止日期的行數。希望代碼更簡單。

uj5u.com熱心網友回復：

創建一個串列列，其中包含每組日期之間的月份序列，然后取消嵌套和計數。

筆記：

我習慣lubridate::floor_date()四舍五入enrollment_start到每月的第一天。否則，如果是在每月 29 日或更晚，seq()則可能會跳過幾個月。enrollment_start
您的示例資料的第五行enrollment_start遲于enrollment_end- 我認為這是一個錯誤并被洗掉。

library(tidyverse)
library(lubridate)

enrollments %>% 
  mutate(
    across(c(enrollment_start, enrollment_end), dmy),  # convert to date
    month = map2(
      floor_date(enrollment_start, unit = "month"),    # round to 1st day
      enrollment_end,
      ~ seq(.x, .y, by = "month")
    )
  ) %>% 
  unnest_longer(month) %>% 
  count(month, name = "enrollment_count")

#> # A tibble: 27 x 2
#>    month      enrollment_count
#>    <date>                <int>
#>  1 2017-01-01                2
#>  2 2017-02-01                2
#>  3 2017-03-01                1
#>  4 2017-04-01                2
#>  5 2017-05-01                2
#>  6 2017-06-01                3
#>  7 2017-07-01                3
#>  8 2017-08-01                3
#>  9 2017-09-01                3
#> 10 2017-10-01                3
#> # ... with 17 more rows

^{由reprex 包于 2022-03-25 創建(v2.0.1)}

uj5u.com熱心網友回復：

這是我對dplyrand的看法tidyr。

透視資料，為每個學生創建多行并格式化您的日期。
對學生進行分組并使用complete.
對生成的周期和計數進行分組。

data %>%
  pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
    mutate(value = as.Date(value, format =  "%d, %B, %Y")) %>%
    mutate(value = lubridate::floor_date(value, 'month')) %>%
  
#   unique_name name             value     
#   <chr>       <chr>            <date>    
# 1 Amy         enrollment_start 2017-01-01
# 2 Amy         enrollment_end   2018-09-30
# 3 Franklin    enrollment_start 2017-01-01
# 4 Franklin    enrollment_end   2017-02-19
#   ..etc.

  group_by(unique_name) %>%
  complete(value = seq.Date(min(value), max(value), by="month")) %>%
  arrange(unique_name, value) 

enrollment_count <- group_by(data, value) %>%
  count()

編輯：我忘了確定日期，以便在最后正確匯總每個時期。添加floor_datefromlubridate以執行此操作。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/451454.html

標籤：r 日期 dplyr

上一篇：從推文中提取日期（Tweepy，Python）

下一篇：如何在Teradata中將三個變陣列合成一個日期(MM/DD/YYYY)