在R中撰寫一個可變的case_when函式，其中要檢查的案例數量是可變的-有解無憂

我對 R 很陌生，所以請多多包涵，我花了好幾個小時試圖弄清楚這一點，所以我想是時候尋求幫助了。

我目前正在嘗試清理資料集的一部分，其中需要我將一列中的答案分組到一個新列中。

例如，當在“國家”列中匹配“法國”時，它會在新的“大陸”列中添加“歐洲”。

基本上是這樣的：

可復制代碼

# a handful of countries to sort
df = data.frame(country = c('france','england','usa','poland','brazil','kenya','canada','england', 'usa', 'france'))

# simplified vectors for each continent
europe <- c('england','france','poland')
north_america <- c('usa','canada')
south_america <- c('brazil')
africa <- c('kenya')

# the grouping
df_updated <- df %>%
  mutate(across(country, ~ case_when(. %in% europe ~ 'europe',
                                   . %in% north_america ~ 'north america',
                                   . %in% south_america ~ 'south america',
                                   . %in% africa ~ 'africa'),.names = 'region'))

這很好用。但是，我必須在許多資料集中的幾十個不同類別中進行這種型別的分組。我知道只復制和粘貼大量代碼不是一個好習慣，所以我嘗試撰寫一個函式來執行此操作。

所以我添加了以下內容：

功能

country_list <- list(europe, north_america, south_america, africa) # a list of the 4 region vectors 
country_cat <- c('europe', 'north america', 'south america', 'africa') # a vector of corosponding labels for the categories

grouping_func <- function(dataframe, name, data, list, category) {
  dataframe %>%
    mutate(across(!!sym(data), ~ case_when(. %in% list[[1]] ~ category[1],
                                           . %in% list[[2]] ~ category[2],
                                           . %in% list[[3]] ~ category[3],
                                           . %in% list[[4]] ~ category[4]), .names = '{name}'))
}

df_updated2 <- grouping_func(df, 'continent', 'country', country_list, country_cat)

這花了一些時間 - 意識到我無法搜索向量等向量，但它作業得很好。

問題

這讓我想到了我的問題。并非我要分類的所有變數都將具有相同的大小。

例如，有 7 個大陸，但只有 4 個美國地區，或 12 個時區，或 10 種顏色的水果或任何我需要分類的東西。

這意味著我需要找到一種基于串列長度迭代串列/類別的方法。

例如，如果我必須將以下內容通過管道傳輸到我的函式中，它將中斷，因為此時的函式被硬編碼為通過 4 個類別的串列作業：

morning <- c(0:11)
afternoon <- c(12:18)
evening <- c(19:23)
time_list <- list(morning, afternoon, evening)
time_cat <- c('morning', 'afternoon', 'evening')

我已經嘗試過以各種方式使用 for 回圈，并且還試圖弄清楚使用 lapply 可能會有所幫助，但兩者我都碰壁了。老實說，我什至不知道我是否特別接近。根據我能想到的任何關鍵字，我已經閱讀了我在谷歌和 SO 上可以找到的所有內容，但我的一部分人想知道我缺乏經驗是否意味著我什至不知道我需要尋找什么，因為我真的一無所獲。

有人可以給我一個關于我正在尋找什么以及如何最好地進行此操作的指標嗎？我真的很想學習，但我現在已經解決了這個問題大約 4 小時，并且沒有比我開始時更進一步??

謝謝

編輯：我接受了 AndS 的回答，雖然它比 Jon Spring 的回答更復雜，但它確實允許我在需要時直接處理代碼，并且我通過它學習了一種與 R 互動的全新方式. 兩個答案都很棒，而且效果很好。我確實認為將來我可能會為我將定期使用的每種型別打包一堆 csv，并將它們匯入合并。再次感謝大家！

uj5u.com熱心網友回復：

case_when您可以使用factor來設定級別和標簽，而不是使用。這意味著您可以擁有任意數量的類別，并且您永遠不需要硬編碼。這是您帖子中的一個示例：

library(tidyverse)

morning <- c(0:11)
afternoon <- c(12:18)
evening <- c(19:23)

time_list <- list(morning, afternoon, evening)
time_cat <- c('morning', 'afternoon', 'evening')


grouping_func <- function(dataframe, name, data, list, category){
  defs <- tibble(catgry = category, 
                 lsts = list) |>
    unnest_longer(lsts)
  
  mutate(dataframe, !!sym(name) := as.character(factor(!!sym(data), 
                                               levels = defs$lsts,
                                               labels = defs$catgry)))
}

example <- tibble(hour = sample(1:24, 10, replace = TRUE))

grouping_func(example, "TOD", "hour", time_list, time_cat)
#> # A tibble: 10 x 2
#>     hour TOD      
#>    <int> <chr>    
#>  1     3 morning  
#>  2    11 morning  
#>  3    14 afternoon
#>  4     9 morning  
#>  5     3 morning  
#>  6    16 afternoon
#>  7    14 afternoon
#>  8    11 morning  
#>  9    17 afternoon
#> 10    20 evening

uj5u.com熱心網友回復：

連接是一個很好的方法。您只需要一個或多個查找表，將基表中的觀察結果與其他表中的質量相關聯。例如：

library(dplyr)
df_lookup <- tribble(
  ~country,   ~continent,
  "france",  "europe",
  "england", "europe",
  "usa",    "north_america",
  "kenya",   "africa"
)

df_last_letter <- tribble(
  ~country,   ~last_letter,
  "france",  "e",
  "england", "d",
  "canada",    "a",
  "kenya",   "a"
)

df %>%
  left_join(df_lookup) %>%
  left_join(df_last_letter) %>%
  mutate(across(c(continent, last_letter),
                ~ coalesce(., "Not mapped")))

結果

Joining, by = "country"
Joining, by = "country"
   country     continent last_letter
1   france        europe           e
2  england        europe           d
3      usa north_america  Not mapped
4   poland    Not mapped  Not mapped
5   brazil    Not mapped  Not mapped
6    kenya        africa           a
7   canada    Not mapped           a
8  england        europe           d
9      usa north_america  Not mapped
10  france        europe           e

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/520905.html

標籤：rdplyr

上一篇：如何在ggplot中添加圖例到散點圖？

下一篇：給定鄰接矩陣和包含邊串列的矩陣，我如何正確地將1編碼為鄰接矩陣而不出錯？