借助R中的幾列洗掉錯誤添加的行-有解無憂

我正在嘗試通過洗掉錯誤添加的行來清理資料框。

這是虛擬資料：

temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", 
                               "02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200, 1000,-1000, -1000, 1000,
                                  -1000,-1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
                               "M567","M567", "M567", 
                               "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA, "98","187","187","12ee","M11","M13",
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")

資料框看起來這個

         Date payment Code    ID
1  24/06/2002     200  ABC  <NA>
2  24/06/2002    1000 M567    98
3  25/06/2002   -1000 M567   187
4  25/06/2002   -1000 M567   187
5  26/06/2002    1000  XYZ  12ee
6  02/07/2002   -1000 M567   M11
7  03/07/2002   -1000  ABX   M13
8  24/07/2002   -1000 M567  <NA>
9  08/07/2002    1200 M567  K999
10 08/07/2002   -1200 M567  K999
11 15/07/2002    1200 M567   111
12 17/07/2002    1200 M567   111
13 22/07/2002     200 M300    11
14 22/07/2002   56700 M678 12345
15 28/07/2002  -56700 M678  <NA>
16 29/07/2002    -200  ABC  <NA>

如您所見，資料中有一些正面和負面的付款。負付款是錯誤添加的交易或退款。

例如， 1200將根據代碼和 ID用-1200抵消，而第 14 行和第 15 行相似，但 ID 為 NA - 所以我必須用其正支付行的 ID 填充它，反之亦然。這樣我就可以洗掉這兩行。

我在 StackOverflow 上的程式員幫助下嘗試的代碼（之前問過）：

library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>% 
  group_by(Code) %>%
  mutate(ind = rowid(payment)) %>%
  group_by(ind, .add = TRUE) %>% 
  fill(ID, .direction = 'downup') #%>% 
  ungroup %>%
  mutate(absPayment = abs(payment)) %>% 
  arrange(ID, Code, absPayment) %>%
  group_by(Code, ID, absPayment) %>%
  mutate(grp = rowid(sign(payment))) %>% 
  group_by(grp, .add = TRUE) %>%
  filter(n() == 1) %>% 
  ungroup %>%   
  select(names(df1))

但這里的問題是第 8 行 - 24/07/2002 -1000 M567 應該由第 2 行填充，因為代碼和正付款匹配 - 這樣以后我可以取消這兩行。由于該行遠離第 8 行.direction = 'downup'不起作用。

而且我認為除了使用方向之外，還有一種更好的方法來填充 NA（因為它沒有被應用到遠處的類似行）

預期輸出為：

         Date payment Code    ID

1  25/06/2002   -1000 M567   187
2  25/06/2002   -1000 M567   187
3  26/06/2002    1000  XYZ  12ee
4  02/07/2002   -1000 M567   M11
5  03/07/2002   -1000  ABX   M13
6  15/07/2002    1200 M567   111
7  17/07/2002    1200 M567   111
8  22/07/2002     200 M300    11

自 5 天以來，我對此感到震驚。任何解決方案都會非常有幫助。

提前致謝

另一個可能的虛擬資料：

temp_2 <-  structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", 
                               "02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200,-1000, 1000,-1000, -1000, 1000,
                                  -1000,-1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
                               "M567","M567", "M567", 
                               "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA,"187", "98","187","187","12ee",NA,NA,
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")

temp_2 的預期輸出：

         Date payment Code    ID

1  23/06/2002   -1000 M567   187
2  25/06/2002   -1000 M567   187
3  25/06/2002   -1000 M567   187
4  26/06/2002    1000  XYZ  12ee
5  03/07/2002   -1000  ABX  <NA>
6  24/07/2002   -1000 M567   98
7 15/07/2002    1200 M567   111
8 17/07/2002    1200 M567   111
9 22/07/2002     200 M300    11

uj5u.com熱心網友回復：

這是我解決它的嘗試，訣竅是正確替換 NA。

# fill NAs according to their values 
temp <- temp %>% 
  mutate(abs_payment = abs(payment)) %>% 
  group_by(abs_payment, ID, Code) %>% 
  # should consider replacement only if ID has only one row or if it is NA
  mutate(is_candidate = (n() == 1) | is.na(ID)) %>%
  group_by(abs_payment, Code) %>% 
  # we do not want to replace IDs for non-na IDs 
  mutate(new_ID = case_when(is_candidate & is.na(ID) ~ na.omit(ID)[1],
                            TRUE ~ ID))


# remove if sum equal to 0 
temp <- temp %>% 
  group_by(Code, new_ID, abs_payment) %>% 
  mutate(total = sum(payment)) %>% 
  filter(total != 0 )

uj5u.com熱心網友回復：

我們能試試

library(dplyr)
library(data.table)
temp %>% 
 group_by(Code) %>%
  filter(sum(payment) != 0) %>%
  arrange(Code, abs(payment), !is.na(ID)) %>% 
  mutate(ind = rowid(payment)) %>%
  group_by(ind, .add = TRUE) %>% 
  fill(ID, .direction = "downup") %>%
  ungroup %>%    
   mutate(absPayment = abs(payment)) %>% 
   arrange(ID, Code, absPayment) %>%
   group_by(Code, ID, absPayment) %>%
   mutate(grp = rowid(sign(payment))) %>% 
   group_by(grp, .add = TRUE) %>%
   filter(n() == 1) %>% 
   ungroup %>%   
   select(names(temp))

-輸出

# A tibble: 8 × 4
  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 22/07/2002     200 M300  11   
2 15/07/2002    1200 M567  111  
3 17/07/2002    1200 M567  111  
4 26/06/2002    1000 XYZ   12ee 
5 25/06/2002   -1000 M567  187  
6 25/06/2002   -1000 M567  187  
7 02/07/2002   -1000 M567  M11  
8 03/07/2002   -1000 ABX   M13

對于第二種情況

temp_2 %>% 
 group_by(Code) %>%
 filter(sum(payment) != 0) %>% 
 arrange(Code, abs(payment), !is.na(ID)) %>% 
 mutate(ind = rowid(payment)) %>%
  group_by(ind, .add = TRUE) %>% 
  fill(ID, .direction = "downup") %>% 
  ungroup %>%
  mutate(absPayment = abs(payment)) %>% 
  arrange(ID, Code, absPayment) %>%
  group_by(Code, ID, absPayment) %>%
  mutate(grp = rowid(sign(payment))) %>% 
  group_by(grp, .add = TRUE) %>%
  filter(n() == 1) %>% 
  ungroup %>%   
  select(names(df1))

-輸出

# A tibble: 9 × 4
  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 22/07/2002     200 M300  11   
2 24/07/2002   -1000 M567  111  
3 15/07/2002    1200 M567  111  
4 17/07/2002    1200 M567  111  
5 26/06/2002    1000 XYZ   12ee 
6 23/06/2002   -1000 M567  187  
7 25/06/2002   -1000 M567  187  
8 25/06/2002   -1000 M567  187  
9 03/07/2002   -1000 ABX   <NA>

uj5u.com熱心網友回復：

我不確定這是否正確，但這是我的嘗試。我不知道您是如何得出預期輸出的。他們還有其他過濾標準嗎？您的原始資料超過 8 行。

library(tidyverse)

temp |>
  mutate(Date = lubridate::dmy(Date)) |>
  arrange(Code, abs(payment)) |>
  group_by(Code, abs(payment), ID) |>
  mutate(n = n()) |>
  ungroup()|>
  group_by(Code, abs(payment), n) |>
  fill(ID, .direction = "updown") |>
  ungroup()|>
  select(names(temp)) |>
  arrange(Date, abs(payment))
#> # A tibble: 16 x 4
#>    Date       payment Code  ID   
#>    <date>       <dbl> <chr> <chr>
#>  1 2002-06-24     200 ABC   <NA> 
#>  2 2002-06-24    1000 M567  98   
#>  3 2002-06-25   -1000 M567  187  
#>  4 2002-06-25   -1000 M567  187  
#>  5 2002-06-26    1000 XYZ   12ee 
#>  6 2002-07-02   -1000 M567  M11  
#>  7 2002-07-03   -1000 ABX   M13  
#>  8 2002-07-08    1200 M567  K999 
#>  9 2002-07-08   -1200 M567  K999 
#> 10 2002-07-15    1200 M567  111  
#> 11 2002-07-17    1200 M567  111  
#> 12 2002-07-22     200 M300  11   
#> 13 2002-07-22   56700 M678  12345
#> 14 2002-07-24   -1000 M567  M11  
#> 15 2002-07-28  -56700 M678  12345
#> 16 2002-07-29    -200 ABC   <NA>

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/515273.html

標籤：r数据框数据清理

上一篇：計算中位數的缺失值問題

下一篇：從2個向量創建矩陣，其矩陣值取決于向量元素