通過R中的附加標準（人員和位置）有效地匯總許多不同時間段的資料-有解無憂

我有按天和按地點的污染措施。我有一群人，我想測量他們的污染暴露程度。每個人都有一個位置和他們在該位置的時間段。

對于我資料集中的每個人，我需要總結他們所在位置在其時間段內的污染值，并計算缺失的污染測量值的數量。

表結構如下：

  ids start_dates  end_dates  zips
1   1  2000-10-10 2001-02-18 45108
2   2  2000-11-11 2001-04-07 45190
3   3  2000-03-05 2000-06-27 45117
4   4  2001-02-04 2001-06-09 45142
5   5  2000-03-16 2000-07-13 45197
6   6  1999-12-15 2000-04-27 45060

  exposure_day exposure_zip exposure_value
1   1999-06-26        45108             14
2   1999-06-27        45108             27
3   1999-06-28        45108             22
4   1999-06-29        45108              4
5   1999-06-30        45108             26
6   1999-07-01        45108             20

期望的輸出：

   ids start_dates  end_dates  zips exposure_sum na_count
1:   1  2000-10-10 2001-02-18 45108         3188        5
2:   2  2000-11-11 2001-04-07 45190         3789        1
3:   3  2000-03-05 2000-06-27 45117         2917        3
4:   4  2001-02-04 2001-06-09 45142         2969        2
5:   5  2000-03-16 2000-07-13 45197         2860        3
6:   6  1999-12-15 2000-04-27 45060         3497        2

我目前的解決方案很慢。我想找到一個更有效的解決方案，以便我可以有效地為大約 1,000,000 人執行此計算。

下面是模擬我的資料和我當前解決方案的代碼。

set.seed(123)
library(lubridate)
library(data.table)

# Make person dataframe
n = 1000  # sample size
ids = c(1:n)
end_dates = sample(seq(as.Date('2000-01-01'), as.Date('2002-01-01'), by="day"), n, replace = T)
time_intervals = sample(seq(100, 200), n, replace = T)
start_dates = end_dates - time_intervals
zips = sample(seq(45000, 45200), n, replace = T)

person_df = data.frame(ids, start_dates, end_dates,  zips)

# Make exposure dataframe
ziplist = unique(zips)
nzips = length(ziplist)
ndays = as.numeric(as.Date(max(person_df$end_dates)) - as.Date(min(person_df$start_dates))   1)
exposure_dates = seq(as.Date(min(person_df$start_dates)), as.Date(max(person_df$end_dates)), by = 'day')
exposure_day = rep(exposure_dates, nzips)
exposure_zip = rep(ziplist, each = ndays)
exposure_value = sample(c(NA, 1:50), length(exposure_day), replace = T)

exposure_df = data.frame(exposure_day, exposure_zip, exposure_value)

# convert to datatable
person_dt = data.table(person_df)
exposure_dt = data.table(exposure_df) 

#summarize
summary_dt = person_dt[, ":="(exposure_sum = .(sum(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value], na.rm = T)),
                                na_count =    .(sum(is.na(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value])))),
                        by = 'ids'][]

uj5u.com熱心網友回復：

編輯 --- 添加了@langtang 巧妙方法的變體，它允許在 4 秒內使用 dplyr 實作 n=1M 方法。

這種 dplyr 方法在 n=1000 時大約快 40 倍，在 n=10k 時快 50 倍，在 n=100k 時快 60 倍，但輸出相同。person_df主要的收獲是通過擴展為exposure_day每個范圍中的每行都將非等連接轉換為左連接ids。擴展所有日期的前期步驟可以一次性完成，以使后續連接顯著加快。

當我在 n=1,000,000 的情況下運行它大約需要 2 分鐘，我認為使用原始代碼大約需要 2 小時。我想可以通過移植到data.table或者collapse如果速度不夠快，可以進行進一步的改進。

person_df %>% 
  group_by(ids, exposure_zip = zips) %>%
  summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
  left_join(exposure_df) %>%
  group_by(ids) %>%
  summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
            na_count = sum(is.na(exposure_value))) %>%
  # optional to add start dates end dates, zips columns back
  left_join(person_df)

更新：移植到data.table使用dtplyr稍微改進了 1M 行測驗，在我的機器上達到 100 秒。

library(dtplyr)
person_df %>%
  lazy_dt() %>%
  group_by(ids, exposure_zip = zips) %>%
  summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
  left_join(exposure_df) %>%
  group_by(ids) %>%
  summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
            na_count = sum(is.na(exposure_value)), .groups = "drop") %>%
  left_join(person_df) %>%
  collect()

@langtang 的方法很聰明，它認識到每個 id 范圍內的總和可以通過從范圍開始前一天的累積值中減去范圍末尾的累積值來更有效地完成。這將 n=1M 的時間縮短到 4 秒，即使使用 dplyr 較慢的聚合計算也是如此。

exposure_df_cuml <- exposure_df %>%
  group_by(zips = exposure_zip) %>%
  transmute(value = exposure_day, 
            expo_cuml = cumsum(coalesce(exposure_value,0)),
            expo_na_cuml = cumsum(is.na(exposure_value))) %>%
  ungroup()

person_df %>%
  tidyr::pivot_longer(ends_with("dates")) %>%
  mutate(value = value - if_else(name == "start_dates", 1, 0)) %>%
  left_join(exposure_df_cuml) %>%
  mutate(across(starts_with("expo"), ~if_else(name == "start_dates", -.x, .x))) %>%
  group_by(ids, zips) %>%
  summarize(across(starts_with("expo"), sum), .groups = "drop")

uj5u.com熱心網友回復：

您應該能夠將一百萬個 id 的運行時間縮短到幾秒鐘。

這里的技巧是在整個天數范圍內使用累積總和，然后從 id 開始日期的累積總和中減去 id 在結束日期的累積總和。這非常快，因為它不需要任何行擴展，并且除了直接合并之外，不需要任何按 id 分組：

第 1 步：創建曝光值 ( cval) 和NA值數 ( nas)的累積總和

exposure_dt[order(exposure_day), `:=`(
  cval=cumsum(fifelse(is.na(exposure_value),0,exposure_value)),
  nas = cumsum(is.na(exposure_value))
  ),exposure_zip]

第2步：簡單地熔化person_dt框架，并在exposure_dt框架上直接合并。如果這是開始日并且曝光值不是 NA，請確保從累積總和中減去曝光值；nas如果這是開始日并且曝光值為，則類似地減去一個NA。

k <- melt(person_dt,id.vars = c("ids","zips")) %>% 
  .[exposure_dt, on=.(zips==exposure_zip, value=exposure_day), nomatch=0] %>% 
  .[variable=="start_dates" & is.na(exposure_value), nas:=nas-1] %>% 
  .[variable=="start_dates" & !is.na(exposure_value),cval:=cval-exposure_value] %>% 
  .[order(ids,value)]

第 3 步：只需從偶數行中減去奇數行，然后cbind將結果person_dt

cbind(
  person_dt,
  k[seq(2,.N,2),.(cval,nas)] - k[seq(1,.N,2),.(cval,nas)]
)

在我的機器上使用原始的 1000 個 ids 資料集，所有這些都需要 0.08 秒。如果我設定n=1000000，大約需要 1.1 秒。

輸出：

        ids start_dates  end_dates  zips  cval   nas
      <int>      <Date>     <Date> <int> <num> <int>
   1:     1  2000-10-10 2001-02-18 45108  3188     5
   2:     2  2000-11-11 2001-04-07 45190  3789     1
   3:     3  2000-03-05 2000-06-27 45117  2917     3
   4:     4  2001-02-04 2001-06-09 45142  2969     2
   5:     5  2000-03-16 2000-07-13 45197  2860     3
  ---                                               
 996:   996  2000-02-21 2000-07-29 45139  4250     2
 997:   997  2000-02-02 2000-07-15 45074  4407     4
 998:   998  2001-07-29 2001-11-15 45139  2686     3
 999:   999  2001-09-10 2001-12-20 45127  2581     1
1000:  1000  2000-10-15 2001-05-01 45010  4941     2

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/514361.html

標籤：r表现dplyr数据表

上一篇：R，用周圍數字替換0的演算法

下一篇：與模式匹配相比，“Option.bind”是否有性能成本？