我正在嘗試使用資料來預測一些樂譜。其中一列是流派,它看起來像這樣:
流派專欄
c("['rock-and-roll', 'space age pop', 'surf music']", "['dance pop', 'pop', 'post-teen pop']",
"['pop', 'post-teen pop']", "['country', 'country dawn', 'nashville sound']",
"['australian country', 'contemporary country', 'country', 'country road']",
"['blues rock', 'garage rock', 'modern blues rock', 'neo-psychedelic', 'nu gaze', 'punk blues']",
"['pop', 'post-teen pop']", "['adult standards', 'brill building pop', 'folk', 'folk rock', 'mellow gold', 'singer-songwriter', 'soft rock', 'yacht rock']",
"['adult standards', 'brill building pop', 'bubblegum pop', 'folk rock', 'lounge', 'mellow gold', 'rock-and-roll', 'rockabilly', 'sunshine pop']",
"['adult standards', 'brill building pop', 'canadian pop', 'easy listening', 'lounge', 'rock-and-roll']",
"[]", "['boston rock', 'dance rock', 'new romantic', 'new wave', 'new wave pop']",
"['classic soul']", "['classic country pop', 'country', 'nashville sound', 'outlaw country', 'singer-songwriter', 'texas country']",
"['adult standards', 'brill building pop', 'bubblegum pop', 'doo-wop', 'rock-and-roll', 'rockabilly']",
"['brill building pop', 'doo-wop', 'rhythm and blues']", "[]",
"['album rock', 'art rock', 'blues rock', 'classic rock', 'hard rock', 'metal', 'psychedelic rock', 'rock', 'soft rock']",
"['blues', 'blues rock', 'classic rock', 'electric blues', 'folk rock', 'funk', 'jazz blues', 'louisiana blues', 'new orleans blues', 'piano blues', 'psychedelic rock', 'roots rock', 'soul']",
"['album rock', 'canadian pop', 'canadian singer-songwriter', 'classic canadian rock', 'heartland rock', 'mellow gold', 'rock', 'soft rock']",
"['art rock', 'dance rock', 'new romantic', 'new wave', 'new wave pop', 'permanent wave', 'rock', 'synthpop']",
"['album rock', 'blues rock', 'classic rock', 'country rock', 'hard rock', 'mellow gold', 'rock', 'soft rock', 'southern rock']",
"['adult standards', 'brill building pop', 'easy listening', 'lounge', 'rock-and-roll', 'rockabilly']",
"['christmas instrumental']", "['adult standards', 'brill building pop', 'bubblegum pop', 'classic country pop', 'country rock', 'folk', 'folk rock', 'mellow gold', 'soft rock']",
"['adult standards', 'brill building pop', 'chicago soul', 'classic soul', 'motown', 'quiet storm', 'rhythm and blues', 'rock-and-roll', 'rockabilly', 'soul']")
我想將其用作預測的因子變數(或虛擬因子變數)。如何從串列中提取流派名稱并將它們轉換為虛擬變數列?
當我將流派轉換為虛擬列時會發生什么:
| '成人標準'、'brill building pop'、'輕松聆聽'、'醇厚金' | 'dance pop'、'pop'、'post-teen pop' |
|---|---|
| 1 | 0 |
| 0 | 1 |
我想要的是:
| 成人標準 | 明亮的建筑流行音樂 |
|---|---|
| 1 | 1 |
| 0 | 0 |
uj5u.com熱心網友回復:
一個整潔的解決方案
tidyr::separate_rows()以和為中心tidyr::pivot_longer():
library(tidyr)
library(dplyr)
library(stringr)
gdata <- gdata %>%
mutate(
id = row_number(),
genre = na_if(str_remove_all(genre, "\\[|\\]|'"), ""),
value = 1
) %>%
separate_rows(genre, sep = ", ") %>%
pivot_wider(names_from = genre, values_fill = 0) %>%
select(!`NA`)
gdata
# A tibble: 26 × 71
id rock-and-r…1 space…2 surf …3 dance…? pop post-…? country count…? nashv…?
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 0 0 0 0 0 0
2 2 0 0 0 1 1 1 0 0 0
3 3 0 0 0 0 1 1 0 0 0
4 4 0 0 0 0 0 0 1 1 1
5 5 0 0 0 0 0 0 1 0 0
6 6 0 0 0 0 0 0 0 0 0
7 7 0 0 0 0 1 1 0 0 0
8 8 0 0 0 0 0 0 0 0 0
9 9 1 0 0 0 0 0 0 0 0
10 10 1 0 0 0 0 0 0 0 0
# … with 16 more rows, 61 more variables: `australian country` <dbl>,
# `contemporary country` <dbl>, `country road` <dbl>, `blues rock` <dbl>,
# `garage rock` <dbl>, `modern blues rock` <dbl>, `neo-psychedelic` <dbl>,
# `nu gaze` <dbl>, `punk blues` <dbl>, `adult standards` <dbl>,
# `brill building pop` <dbl>, folk <dbl>, `folk rock` <dbl>,
# `mellow gold` <dbl>, `singer-songwriter` <dbl>, `soft rock` <dbl>,
# `yacht rock` <dbl>, `bubblegum pop` <dbl>, lounge <dbl>, rockabilly <dbl>, …
基礎 R 解決方案
- 對于每個值,洗掉無關字符并以逗號分隔。這為您提供了一個串列,其中每行包含一個字符向量。
- 用于
unique(unlist())獲取所有獨特流派的向量。 - 回圈播放獨特的流派;對于每個,在您的資料框中添加一列,測驗該型別是否出現在每一行中。如果您更喜歡 0 和 1,可以在
as.integer()此處添加。
genre_list <- sapply(
gdata$genre,
\(x) strsplit(gsub("\\[|\\]|'", "", x), ", ")
)
all_genres <- unique(unlist(genre_list))
for (g in all_genres) {
gdata[[g]] <- sapply(genre_list, \(x) g %in% x)
}
gdata[1:10, 2:8]
rock-and-roll space age pop surf music dance pop pop post-teen pop country
1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE TRUE TRUE TRUE FALSE
3 FALSE FALSE FALSE FALSE TRUE TRUE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
5 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
7 FALSE FALSE FALSE FALSE TRUE TRUE FALSE
8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
9 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
10 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/529060.html
標籤:r
上一篇:向data.table添加倒計時
