我有一個由搜索詞和相應分類組成的資料框,我想根據其中的資料(使用部分字串匹配)將其分配給大型資料框中的新字串變數(每個大約 130 個變數,30 000 行) .
不幸的是,我無法發布實際資料,但我在下面創建了一些示例代碼。大 df 用 MainDF 表示,分類 df 用 CatDF 表示。目的是根據來自 CatDF 的資料在 MainDF 中創建列 Category。
到目前為止,我已經提出了這個解決方案 - 但它效率不高,在我的實際資料中進行一次搜索需要 32 秒,這太長了,因為我需要執行大約 300 多個搜索詞:
library(dplyr)
MainDF <- data.frame (col1 = c("Tables", "Chairs", "Computer monitors", "Lounge suite", "Computer Monitors", "Deck chairs", "Office chairs", "TV monitors", "Side tables"),
col2 = c("Wooden table","Plastic chair","LG monitor","Couch","Samsung screen","Plastic chair","Ergonomic chair", "LG monitor G234","Wooden table"))
CatDF<-data.frame(SearchTerm=c("Chair","Monitor","Screen","Table","TV"),
NewCategory=c("Tables/Chairs","Screens/Monitor","Screens/Monitor","Tables/Chairs","Screens/Monitor"))
MainDF$Category=NA
for (i in 1:nrow(CatDF)){
a <- transform(
as.data.frame(
which(matrix(grepl(CatDF$SearchTerm[i], as.matrix(MainDF[,c(1:2)]),ignore.case = TRUE), nrow = nrow(MainDF)),
arr.ind = TRUE
)))
a<-a %>% distinct(row)
MainDF[a$row,"Category"]=CatDF$NewCategory[i]
}
有沒有更有效的解決方案?我知道回圈通常效率低下,但我想不出另一種方法來做到這一點。
謝謝!
uj5u.com熱心網友回復:
這是一個帶有包模糊連接和函式的示例regex_left_join:
library(fuzzyjoin)
MainDF <- data.frame (col1 = c("Tables", "Chairs", "Computer monitors",
"Lounge suite", "Computer Monitors",
"Deck chairs", "Office chairs",
"TV monitors", "Side tables"),
col2 = c("Wooden table", "Plastic chair", "LG monitor",
"Couch", "Samsung screen", "Plastic chair",
"Ergonomic chair", "LG monitor G234",
"Wooden table"))
CatDF < -data.frame(SearchTerm = c("Chair", "Monitor", "Screen", "Table","TV"),
NewCategory = c("Tables/Chairs", "Screens/Monitor",
"Screens/Monitor", "Tables/Chairs",
"Screens/Monitor"))
regex_left_join(MainDF, CatDF, by = c(col1 = "SearchTerm"), ignore_case=TRUE)
它適用于幾乎所有的例子,除了一個,它應該足以向CatDF.
uj5u.com熱心網友回復:
這是另一個基于的解決方案fuzzyjoin:
library(fuzzyjoin)
library(stringr)
fuzzy_join(
MainDF %>%
mutate(across(everything(), ~tolower(.))),
CatDF %>%
mutate(across(everything(), ~tolower(.))),
by = c("col1" = "SearchTerm"),
match_fun = str_detect,
mode = "left"
)
col1 col2 SearchTerm NewCategory
1 tables wooden table table tables/chairs
2 chairs plastic chair chair tables/chairs
3 computer monitors lg monitor monitor screens/monitor
4 lounge suite couch <NA> <NA>
5 computer monitors samsung screen monitor screens/monitor
6 deck chairs plastic chair chair tables/chairs
7 office chairs ergonomic chair chair tables/chairs
8 tv monitors lg monitor g234 monitor screens/monitor
9 tv monitors lg monitor g234 tv screens/monitor
10 side tables wooden table table tables/chairs
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/337501.html
上一篇:如何從字串中洗掉最后一個字符
