我有以下df:
df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))
df 看起來像這樣:
| comp_name | 年 | 指標 | 價值 |
|---|---|---|---|
| 一種 | 2016 年 | 總收入 | 100 |
| 一種 | 2016 年 | 海外收入 | 不適用 |
| 一種 | 2016 年 | 國內收入 | 不適用 |
| 一種 | 2017 | 總收入 | 100 |
| 一種 | 2017 | 海外收入 | 20 |
| 一種 | 2017 | 國內收入 | 80 |
| 乙 | 2016 年 | 總收入 | 90 |
| 乙 | 2016 年 | 海外收入 | 不適用 |
| 乙 | 2016 年 | 國內收入 | 60 |
| 乙 | 2017 | 總收入 | 90 |
| 乙 | 2017 | 海外收入 | 不適用 |
| 乙 | 2017 | 國內收入 | 不適用 |
我想按 comp_name 和 year 進行分組,并對每個組應用以下規則:如果 foreign_revenue 和 domestic_revenue 的值為 NA,則將 domestic_revenue 的值設定為等于 total_revenue 的值,否則什么也不做。
生成的 df 應如下所示:
| comp_name | 年 | 指標 | 價值 |
|---|---|---|---|
| 一種 | 2016 年 | 總收入 | 100 |
| 一種 | 2016 年 | 海外收入 | 不適用 |
| 一種 | 2016 年 | 國內收入 | 100 |
| 一種 | 2017 | 總收入 | 100 |
| 一種 | 2017 | 海外收入 | 20 |
| 一種 | 2017 | 國內收入 | 80 |
| 乙 | 2016 年 | 總收入 | 90 |
| 乙 | 2016 年 | 海外收入 | 不適用 |
| 乙 | 2016 年 | 國內收入 | 60 |
| 乙 | 2017 | 總收入 | 90 |
| 乙 | 2017 | 海外收入 | 不適用 |
| 乙 | 2017 | 國內收入 | 90 |
我的實際資料集有 500k 行,包含 12 個不同的指標,但我無法找到可行的方法。任何幫助將不勝感激 - 謝謝!
uj5u.com熱心網友回復:
你可以用兩個支點來做到這一點:
library(dplyr)
library(tidyr)
df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))
df %>%
pivot_wider(names_from="indicator",
values_from = "value") %>%
mutate(domestic_revenue = case_when(
is.na(overseas_revenue) & is.na(domestic_revenue) ~ total_revenue,
TRUE ~ domestic_revenue)) %>%
pivot_longer(-c(comp_name, year),
names_to = "indicator",
values_to = "value")
#> # A tibble: 12 × 4
#> comp_name year indicator value
#> <chr> <chr> <chr> <dbl>
#> 1 A 2016 total_revenue 100
#> 2 A 2016 overseas_revenue NA
#> 3 A 2016 domestic_revenue 100
#> 4 A 2017 total_revenue 100
#> 5 A 2017 overseas_revenue 20
#> 6 A 2017 domestic_revenue 80
#> 7 B 2016 total_revenue 90
#> 8 B 2016 overseas_revenue NA
#> 9 B 2016 domestic_revenue 60
#> 10 B 2017 total_revenue 90
#> 11 B 2017 overseas_revenue NA
#> 12 B 2017 domestic_revenue 90
由reprex 包創建于 2022-04-28 (v2.0.1)
uj5u.com熱心網友回復:
require(tidyverse)
df %>%
spread(indicator, value) %>%
mutate(domestic_revenue = case_when(
is.na(domestic_revenue) & is.na(overseas_revenue) ~ total_revenue,
TRUE ~ domestic_revenue
)) %>%
gather(c(-comp_name, -year), key = indicator, value = value) %>%
arrange(comp_name, year)
# A tibble: 12 x 4
comp_name year indicator value
<chr> <chr> <chr> <dbl>
1 A 2016 domestic_revenue 100
2 A 2016 overseas_revenue NA
3 A 2016 total_revenue 100
4 A 2017 domestic_revenue 80
5 A 2017 overseas_revenue 20
6 A 2017 total_revenue 100
7 B 2016 domestic_revenue 60
8 B 2016 overseas_revenue NA
9 B 2016 total_revenue 90
10 B 2017 domestic_revenue 90
11 B 2017 overseas_revenue NA
12 B 2017 total_revenue 90
uj5u.com熱心網友回復:
我認為最好創建一個非常簡單的函式來處理您想要的調整,并按組應用該函式。這將比旋轉快得多。
f <- function(i,v) {
if(all(is.na(v[grepl("^(o|d)",i)]))) v[i=="domestic_revenue"]=v[i=="total_revenue"]
return(v)
}
使用 data.table (會很快)
setDT(df)[,value:=f(indicator,value), by=.(comp_name, year)]
使用 dplyr(會更慢,但仍然比旋轉更快)
df %>%
group_by(comp_name,year) %>%
mutate(value=f(indicator,value))
輸出:
comp_name year indicator value
<char> <char> <char> <num>
1: A 2016 total_revenue 100
2: A 2016 overseas_revenue NA
3: A 2016 domestic_revenue 100
4: A 2017 total_revenue 100
5: A 2017 overseas_revenue 20
6: A 2017 domestic_revenue 80
7: B 2016 total_revenue 90
8: B 2016 overseas_revenue NA
9: B 2016 domestic_revenue 60
10: B 2017 total_revenue 90
11: B 2017 overseas_revenue NA
12: B 2017 domestic_revenue 90
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/466656.html
上一篇:使用isna()過濾資料框以過濾以下列中具有空值的行
下一篇:并排合并Pandas資料框列
