當同一單詞出現在另一列時合并行-有解無憂

我想將在另一列中具有相同單詞的行合并在一起。解決方案應該在R Base 中。表條目都是逗號分隔的字串（字符），而不是串列。因此，如下所示，相同顏色的顏色深淺應添加在一行中的字串中，而不是跨越多行。此外，顏色深淺列中不應有重復項。

我已經嘗試過：

   aggregate(df["Color shades"], df["Color"], paste, collapse=", ")

以及：

   aggregate(Color shades ~ Color ,df ,toString)

但這并沒有導致想要的結果。

資料框：

    df <- data.frame(colorshades = c("turquoise, babyblue", "royal blue, true blue", 
                         "navy blue, true blue"), colors = c("blue", "blue", "blue"))

目前：

顏色深淺	顏色
綠松石，淡藍色	藍色
皇家藍，真藍	藍色
海軍藍，真藍	藍色

期望輸出：

顏色深淺	顏色
綠松石、嬰兒藍、寶藍色、真藍、海軍藍	藍色

uj5u.com熱心網友回復：

轉換"Color shades"為串列列：

lapply(strsplit(df[["Color shades"]], ","), trimws)
# [[1]]
# [1] "turquoise" "babyblue" 
# [[2]]
# [1] "royal blue" "true blue" 
# [[3]]
# [1] "navy blue" "true blue"
df[["Color shades"]] <- lapply(strsplit(df[["Color shades"]], ","), trimws)
df
#            Color shades Color
# 1   turquoise, babyblue  blue
# 2 royal blue, true blue  blue
# 3  navy blue, true blue  blue

聚合unique：

aggregate(df["Color shades"], df["Color"], function(z) paste(unique(unlist(z)), collapse=", "))
#   Color                                          Color shades
# 1  blue turquoise, babyblue, royal blue, true blue, navy blue

或者，與串列列方法保持一致，

aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z))))
#   Color                                          Color shades
# 1  blue turquoise, babyblue, royal blue, true blue, navy blue
str(aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z)))))
# 'data.frame': 1 obs. of  2 variables:
#  $ Color       : chr "blue"
#  $ Color shades:List of 1
#   ..$ : chr  "turquoise" "babyblue" "royal blue" "true blue" ...

處理串列列副逗號分隔值通常（但并非總是）有優勢。如果您的用例是這樣的，您經常想查看這些欄位之一中的單個元素，您會發現自己深入處理正則運算式和/或反復使用strsplit分隔符。使用串列列，人們可以使用類似unique和%in%放棄的工具（盡管不可否認，人們應該更習慣lapply/ sapply，并且許多用于聚合的 base-R 工具并不總是與串列列一致地作業）。

資料

df <- structure(list(`Color shades` = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"), Color = c("blue", "blue", "blue")), class = "data.frame", row.names = c(NA, -3L))

uj5u.com熱心網友回復：

如果您可以使用庫“dplyr”，您也可以這樣做：

library(dplyr)

df <- data.frame("Colorshade" = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"),
             "Color" = c(rep("blue", 3)),
             stringsAsFactors = FALSE)

my_df <- df %>% group_by(Color) %>% mutate(Colorshade = paste(unique(sort(str_split(string = paste(df$Colorshade, collapse = ", "), pattern = ", ", simplify = TRUE))), collapse = ", ")) %>% first()

uj5u.com熱心網友回復：

data.table 解決方案

library(data.table)
setDT(df)[, .(Color_shades = paste0(unique(unlist(strsplit(colorshades, ", "))), 
                                    collapse = ", ")), 
          by = .(colors)]
#    colors                                          Color_shades
# 1:   blue turquoise, babyblue, royal blue, true blue, navy blue

uj5u.com熱心網友回復：

也可以使用tidytextto with unnest

library(dplyr)
library(tidytext)

color_df <- tibble(color= rep("blue", times = 3),
                       color_shades = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"))

color_shades_agg <- color_df %>% 
  unnest_tokens(word, color_shades, token = 'regex', pattern=", ") %>% 
  group_by(color) %>%
  distinct() %>% 
  summarise(color_shades = paste0(sort(word), collapse = ", "))

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/378700.html

標籤：r

上一篇：R-根據跨幾列的值范圍過濾行

下一篇：引數值是多少