我正在嘗試通過洗掉錯誤添加的行來清理資料框。
這是虛擬資料:
temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA, "98","187","187","12ee","M11","M13",
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")
資料框看起來這個
Date payment Code ID
1 24/06/2002 200 ABC <NA>
2 24/06/2002 1000 M567 98
3 25/06/2002 -1000 M567 187
4 25/06/2002 -1000 M567 187
5 26/06/2002 1000 XYZ 12ee
6 02/07/2002 -1000 M567 M11
7 03/07/2002 -1000 ABX M13
8 24/07/2002 -1000 M567 <NA>
9 08/07/2002 1200 M567 K999
10 08/07/2002 -1200 M567 K999
11 15/07/2002 1200 M567 111
12 17/07/2002 1200 M567 111
13 22/07/2002 200 M300 11
14 22/07/2002 56700 M678 12345
15 28/07/2002 -56700 M678 <NA>
16 29/07/2002 -200 ABC <NA>
如您所見,資料中有一些正面和負面的付款。負付款是錯誤添加的交易或退款。
例如, 1200將根據代碼和 ID用-1200抵消, 而第 14 行和第 15 行相似,但 ID 為 NA - 所以我必須用其正支付行的 ID 填充它,反之亦然。這樣我就可以洗掉這兩行。
我在 StackOverflow 上的程式員幫助下嘗試的代碼(之前問過):
library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>%
group_by(Code) %>%
mutate(ind = rowid(payment)) %>%
group_by(ind, .add = TRUE) %>%
fill(ID, .direction = 'downup') #%>%
ungroup %>%
mutate(absPayment = abs(payment)) %>%
arrange(ID, Code, absPayment) %>%
group_by(Code, ID, absPayment) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
select(names(df1))
但這里的問題是第 8 行 - 24/07/2002 -1000 M567 應該由第 2 行填充,因為代碼和正付款匹配 - 這樣以后我可以取消這兩行。由于該行遠離第 8 行.direction = 'downup'不起作用。
而且我認為除了使用方向之外,還有一種更好的方法來填充 NA(因為它沒有被應用到遠處的類似行)
預期輸出為:
Date payment Code ID
1 25/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 26/06/2002 1000 XYZ 12ee
4 02/07/2002 -1000 M567 M11
5 03/07/2002 -1000 ABX M13
6 15/07/2002 1200 M567 111
7 17/07/2002 1200 M567 111
8 22/07/2002 200 M300 11
自 5 天以來,我對此感到震驚。任何解決方案都會非常有幫助。
提前致謝
另一個可能的虛擬資料:
temp_2 <- structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002",
"02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
"08/07/2002", "15/07/2002", "17/07/2002",
"22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"),
payment = c(200,-1000, 1000,-1000, -1000, 1000,
-1000,-1000,-1000, 1200,
-1200, 1200, 1200,
200, 56700, -56700, -200),
Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
"M567","M567", "M567",
"M567", "M567", "M300",
"M678", "M678", "ABC"),
ID = c(NA,"187", "98","187","187","12ee",NA,NA,
NA,"K999",
"K999", "111", "111", "11",
"12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")
temp_2 的預期輸出:
Date payment Code ID
1 23/06/2002 -1000 M567 187
2 25/06/2002 -1000 M567 187
3 25/06/2002 -1000 M567 187
4 26/06/2002 1000 XYZ 12ee
5 03/07/2002 -1000 ABX <NA>
6 24/07/2002 -1000 M567 98
7 15/07/2002 1200 M567 111
8 17/07/2002 1200 M567 111
9 22/07/2002 200 M300 11
uj5u.com熱心網友回復:
這是我解決它的嘗試,訣竅是正確替換 NA。
# fill NAs according to their values
temp <- temp %>%
mutate(abs_payment = abs(payment)) %>%
group_by(abs_payment, ID, Code) %>%
# should consider replacement only if ID has only one row or if it is NA
mutate(is_candidate = (n() == 1) | is.na(ID)) %>%
group_by(abs_payment, Code) %>%
# we do not want to replace IDs for non-na IDs
mutate(new_ID = case_when(is_candidate & is.na(ID) ~ na.omit(ID)[1],
TRUE ~ ID))
# remove if sum equal to 0
temp <- temp %>%
group_by(Code, new_ID, abs_payment) %>%
mutate(total = sum(payment)) %>%
filter(total != 0 )
uj5u.com熱心網友回復:
我們能試試
library(dplyr)
library(data.table)
temp %>%
group_by(Code) %>%
filter(sum(payment) != 0) %>%
arrange(Code, abs(payment), !is.na(ID)) %>%
mutate(ind = rowid(payment)) %>%
group_by(ind, .add = TRUE) %>%
fill(ID, .direction = "downup") %>%
ungroup %>%
mutate(absPayment = abs(payment)) %>%
arrange(ID, Code, absPayment) %>%
group_by(Code, ID, absPayment) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
select(names(temp))
-輸出
# A tibble: 8 × 4
Date payment Code ID
<chr> <dbl> <chr> <chr>
1 22/07/2002 200 M300 11
2 15/07/2002 1200 M567 111
3 17/07/2002 1200 M567 111
4 26/06/2002 1000 XYZ 12ee
5 25/06/2002 -1000 M567 187
6 25/06/2002 -1000 M567 187
7 02/07/2002 -1000 M567 M11
8 03/07/2002 -1000 ABX M13
對于第二種情況
temp_2 %>%
group_by(Code) %>%
filter(sum(payment) != 0) %>%
arrange(Code, abs(payment), !is.na(ID)) %>%
mutate(ind = rowid(payment)) %>%
group_by(ind, .add = TRUE) %>%
fill(ID, .direction = "downup") %>%
ungroup %>%
mutate(absPayment = abs(payment)) %>%
arrange(ID, Code, absPayment) %>%
group_by(Code, ID, absPayment) %>%
mutate(grp = rowid(sign(payment))) %>%
group_by(grp, .add = TRUE) %>%
filter(n() == 1) %>%
ungroup %>%
select(names(df1))
-輸出
# A tibble: 9 × 4
Date payment Code ID
<chr> <dbl> <chr> <chr>
1 22/07/2002 200 M300 11
2 24/07/2002 -1000 M567 111
3 15/07/2002 1200 M567 111
4 17/07/2002 1200 M567 111
5 26/06/2002 1000 XYZ 12ee
6 23/06/2002 -1000 M567 187
7 25/06/2002 -1000 M567 187
8 25/06/2002 -1000 M567 187
9 03/07/2002 -1000 ABX <NA>
uj5u.com熱心網友回復:
我不確定這是否正確,但這是我的嘗試。我不知道您是如何得出預期輸出的。他們還有其他過濾標準嗎?您的原始資料超過 8 行。
library(tidyverse)
temp |>
mutate(Date = lubridate::dmy(Date)) |>
arrange(Code, abs(payment)) |>
group_by(Code, abs(payment), ID) |>
mutate(n = n()) |>
ungroup()|>
group_by(Code, abs(payment), n) |>
fill(ID, .direction = "updown") |>
ungroup()|>
select(names(temp)) |>
arrange(Date, abs(payment))
#> # A tibble: 16 x 4
#> Date payment Code ID
#> <date> <dbl> <chr> <chr>
#> 1 2002-06-24 200 ABC <NA>
#> 2 2002-06-24 1000 M567 98
#> 3 2002-06-25 -1000 M567 187
#> 4 2002-06-25 -1000 M567 187
#> 5 2002-06-26 1000 XYZ 12ee
#> 6 2002-07-02 -1000 M567 M11
#> 7 2002-07-03 -1000 ABX M13
#> 8 2002-07-08 1200 M567 K999
#> 9 2002-07-08 -1200 M567 K999
#> 10 2002-07-15 1200 M567 111
#> 11 2002-07-17 1200 M567 111
#> 12 2002-07-22 200 M300 11
#> 13 2002-07-22 56700 M678 12345
#> 14 2002-07-24 -1000 M567 M11
#> 15 2002-07-28 -56700 M678 12345
#> 16 2002-07-29 -200 ABC <NA>
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/515273.html
標籤:r数据框数据清理
上一篇:計算中位數的缺失值問題
