我正在嘗試對來自 UCI 開放資料倉庫的在線新流行度資料集進行一些分析:https ://archive.ics.uci.edu/ml/datasets/online news popularity
該資料集有一組 7 個布爾屬性,表示文章發表的星期幾。例如,如果文章是在星期一發布的,則該列weekday_is_monday將具有值1,依此類推。為了我的分析,我試圖將這些欄位合并到一個包含發布日字串文字的單個欄位中。
所以我加載這個資料集,然后用字串文字替換每個真值:
news <- read.csv("path_to_my_dataset",
header=TRUE,
sep=",",
fill=F,
strip.white = T,
stringsAsFactors=FALSE)
news$weekday_is_monday <- gsub('^1', 'Monday', news$weekday_is_monday)
news$weekday_is_tuesday <- gsub('^1', 'Tuesday', news$weekday_is_tuesday)
news$weekday_is_wednesday <- gsub('^1', 'Wednesday', news$weekday_is_wednesday)
news$weekday_is_thursday <- gsub('^1', 'Thusday', news$weekday_is_thursday)
news$weekday_is_friday <- gsub('^1', 'Friday', news$weekday_is_friday)
news$weekday_is_saturday <- gsub('^1', 'Saturday', news$weekday_is_saturday)
news$weekday_is_sunday <- gsub('^1', 'Sunday', news$weekday_is_sunday)
接下來我在這個執行緒中找到了一個使用該dpyler::coalesce函式合并所有欄位的解決方案。我將其改編為我的資料集,如下所示:
news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
mutate(news, publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)
但是,這只會合并第一列(即星期一)中的欄位:
0 Monday
32983 6661
我在哪里錯了?
uj5u.com熱心網友回復:
在管道操作中,不要將左邊的資料重復輸入mutate到右邊的資料中。這就是你的問題的原因。你只需要news洗掉mutate(news, publishing_day = ...)
news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
mutate(publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)
# Friday Monday Saturday Sunday Thusday Tuesday Wednesday
# 5701 6661 2453 2737 7267 7390 7435
uj5u.com熱心網友回復:
這是一種使用 reshaping 的技術pivot_longer,然后gsub將文字從weekday_列中取出,然后將其重新加入。
quux <- read.csv("OnlineNewsPopularity.csv") # from your link
library(dplyr)
library(tidyr) # pivot_longer
quux2 <- quux %>%
select(url, starts_with("weekday_is")) %>%
pivot_longer(-url) %>%
dplyr::filter(value > 0) %>%
mutate(weekday = gsub("weekday_is_", "", name)) %>%
left_join(quux, by = "url") %>%
select(-name, -starts_with("weekday_is_"))
quux2
# # A tibble: 39,644 x 56
# url value weekday timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_uniq~ num_hrefs
# <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 http://mash~ 1 monday 731 12 219 0.664 1.00 0.815 4
# 2 http://mash~ 1 monday 731 9 255 0.605 1.00 0.792 3
# 3 http://mash~ 1 monday 731 9 211 0.575 1.00 0.664 3
# 4 http://mash~ 1 monday 731 9 531 0.504 1.00 0.666 9
# 5 http://mash~ 1 monday 731 13 1072 0.416 1.00 0.541 19
# 6 http://mash~ 1 monday 731 10 370 0.560 1.00 0.698 2
# 7 http://mash~ 1 monday 731 8 960 0.418 1.00 0.550 21
# 8 http://mash~ 1 monday 731 12 989 0.434 1.00 0.572 20
# 9 http://mash~ 1 monday 731 11 97 0.670 1.00 0.837 2
# 10 http://mash~ 1 monday 731 10 231 0.636 1.00 0.797 4
# # ... with 39,634 more rows, and 46 more variables: num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>,
# # average_token_length <dbl>, num_keywords <dbl>, data_channel_is_lifestyle <dbl>, data_channel_is_entertainment <dbl>,
# # data_channel_is_bus <dbl>, data_channel_is_socmed <dbl>, data_channel_is_tech <dbl>, data_channel_is_world <dbl>,
# # kw_min_min <dbl>, kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
# # kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
# # self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>, LDA_04 <dbl>,
# # global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>, ...
內容證明:
table(quux2$weekday)
# friday monday saturday sunday thursday tuesday wednesday
# 5701 6661 2453 2737 7267 7390 7435
factor如果您打算將其轉換為 ,則可以考慮將其轉換為arrange,weekday否則它將按字典順序對它們進行排序(如上所示)。
... %>%
mutate(weekday = factor(weekday, levels = c("monday", "tuesday", "wednesday", ..., "sunday")))
僅供參考,我將管道分配給quux新變數的唯一原因quux2是,在您對此進行測驗和評估期間,您不會無意中不可逆轉地覆寫您的主資料集。隨意覆寫回自身。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/442224.html
