有條件地計算資料幀中出現的次數-性能改進-有解無憂

在第一個“F”代碼出現之后，我需要檢測（除其他外）患者串列中第一次出現非“F”代碼。下面的代碼似乎在這方面取得了成功，但是在運行在一百萬個觀察的資料集中的服務器上顯示效率太低。

最終資料集應具有非 F 代碼數 (nhosp) 的變數，以及在 DAIGNOSTICO 變數上出現第一個 F 代碼后找到的第一個非 F 代碼。沒有重復的身份證件。

如何在復雜性和速度方面都進行改進？首選 Tidyverse 管道。

結果應該是這樣的：

# A tibble: 7 × 6
# Groups:   ID [7]
      ID DAIGNOSTICO data_entrada data_saida nhosp ficd
   <dbl> <chr>       <date>       <date>     <dbl> <chr>
1   1555 F180        1930-04-05   2005-03-15     1 T124
2   1234 F100        1980-04-01   2005-03-02     2 O155
3  16666 F120        1990-06-05   2005-03-18     0 <NA>
4 123456 F145        2001-03-07   2005-03-11     2 T123
5 177778 F155        2001-04-13   2005-03-22     2 G123
6 166666 F125        2002-03-12   2005-03-19     2 W345
7  12345 F150        2002-06-03   2005-03-07     4 K709

這是我的代碼目前的樣子：

library(readr)
library(dplyr)
library(tidyr)

simulation <- read_csv("SIMULADO.txt",  col_types = cols(
    data_entrada = col_date("%d/%m/%Y"),
    data_saida = col_date("%d/%m/%Y")
  )
)

simulation <- as.data.frame(simulation)

simulation[, "nhosp"] <- 0
oldpos <- 1

for (i in 1:nrow(simulation)) {
  if (grepl("F", simulation[i, "DAIGNOSTICO"], )) { # Has F?
    oldpos <- i
    clin <- 0
    simulation[i, "hasF"] <- T
  } else {
    simulation[i, "hasF"] <-F
  }
  if (simulation[i, "ID"] == simulation[oldpos, "ID"]) { # same person?
    if (simulation[oldpos, "hasF"] == T) { # Did she/him had F?
      simulation[i, "hasF"] <- T
      if (simulation[i, "data_entrada"] > simulation[oldpos, "data_entrada"]) { # é subsequente?
        if (!grepl("F", simulation[i, "DAIGNOSTICO"], )) { # not-F?
          simulation[i,"hasC"] <- T
          clin <- 1
          simulation[i, "ficd"] <- simulation[i, "DAIGNOSTICO"]
          simulation[i, "nhosp"] <- clin 
          first_cc <- simulation[i, "DAIGNOSTICO"]
      }
    }
    }
  }
}

dt1 <- simulation %>%
  arrange(data_entrada) %>%
  group_by(ID) %>%
  select(ficd) %>%
  drop_na() %>%
  slice(1)

dt2 <- simulation %>%
  arrange(data_entrada) %>%
  group_by(ID) %>%
  filter(hasF == T) %>%
  mutate(nhosp = cumsum(nhosp),
         nhosp = max(nhosp)) %>%
  select(-ficd,-hasF, -hasC) %>%
  distinct(ID, .keep_all = TRUE) %>%
  full_join(dt1, by = "ID")

dt2

這是一個示例資料集，有一些錯誤來檢查代碼的健壯性：

ID,   DAIGNOSTICO, data_entrada,    data_saida
123490, O100,   01/04/1980, 02/03/2005
123490, O100,   01/04/1981, 02/03/2005
123491, O101,   01/04/1980, 02/03/2005
123491, O101,   01/04/1981, 02/03/2005
1234,   F100,   01/04/1980, 02/03/2005
1234,   O155,   02/04/1980, 03/03/2005
1234,   G123,   05/05/1982, 04/03/2005
12345,  T124,   01/06/2002, 05/03/2005
12345,  Y124,   02/06/2002, 06/03/2005
12345,  F150,   03/06/2002, 07/03/2005
12345,  K709,   04/06/2002, 08/03/2005
12345,  Y709,   05/06/2002, 09/03/2005
12345,  F150,   03/06/2002, 07/03/2005
12345,  K710,   06/06/2002, 08/03/2005
12345,  K711,   07/06/2002, 10/03/2005
12345,  F150,   08/06/2002, 07/03/2005
123456, F145,   07/03/2001, 11/03/2005
123456, T123,   08/03/2001, 12/03/2005
123456, P123,   09/03/2001, 13/03/2005
1555    ,R155,  04/04/1930, 14/03/2005
1555    ,F180,  05/04/1930, 15/03/2005
1555    ,T124,  06/04/1930, 16/03/2005
1555    ,F708,  07/04/1930, 17/03/2005
16666   ,F120,  05/06/1990, 18/03/2005
166666, F125,   12/03/2002, 19/03/2005
166666, W345,   13/03/2002, 20/03/2005
166666, L123,   14/03/2002, 21/03/2005
177778, F155,   13/04/2001, 22/03/2005
177778, G123,   14/04/2001, 23/03/2005
177778, F190,   15/04/2001, 24/03/2005
177778, E124,   16/04/2001, 25/03/2005
177779, G155,   13/04/2001, 22/03/2005
177779, G123,   14/04/2001, 23/03/2005
177779, G190,   15/04/2001, 24/03/2005
177779, E124,   16/04/2001, 25/03/2005

uj5u.com熱心網友回復：

你可以用

library(dplyr)
library(stringr)

df %>% 
  group_by(ID) %>% 
  filter(cumsum(str_detect(DAIGNOSTICO, "^F")) > 0) %>% 
  mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
         ficd  = lead(DAIGNOSTICO)) %>% 
  filter(str_detect(DAIGNOSTICO, "^F")) %>% 
  slice(1) %>% 
  ungroup()

這回傳

# A tibble: 7 x 6
      ID DAIGNOSTICO data_entrada data_saida nhosp ficd 
   <dbl> <chr>       <chr>        <chr>      <int> <chr>
1   1234 F100        01/04/1980   02/03/2005     2 O155 
2   1555 F180        05/04/1930   15/03/2005     1 T124 
3  12345 F150        03/06/2002   07/03/2005     4 K709 
4  16666 F120        05/06/1990   18/03/2005     0 NA   
5 123456 F145        07/03/2001   11/03/2005     2 T123 
6 166666 F125        12/03/2002   19/03/2005     2 W345 
7 177778 F155        13/04/2001   22/03/2005     2 G123

編輯

我認為可能有缺陷，也許

library(dplyr)
library(stringr)

df %>% 
  group_by(ID) %>% 
  filter(
    cumsum(str_detect(DAIGNOSTICO, "^F")) == 1 | 
      !str_detect(DAIGNOSTICO, "^F") & cumsum(str_detect(DAIGNOSTICO, "^F")) > 0
    ) %>% 
  mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
         ficd  = lead(DAIGNOSTICO)) %>% 
  filter(str_detect(DAIGNOSTICO, "^F")) %>% 
  slice(1) %>% 
  ungroup()

是更好的解決方案。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/318520.html

標籤：r

上一篇：更改R中t檢驗的置信度（在by()內部）

下一篇：即使缺少資料點，如何以特定順序（月年）顯示寬表？