我需要創建一個名為的列incorrect,其中包含所有written未出現在target.appleand中的單詞target.banana。
recall <- data.frame(written = c("apples car banana hat pencil r", "papeer apple cars spoon", "dice banaana pen f apple berry"))
recall <- recall %>% mutate(target.apple = str_extract(written,"app([^ ] )"),
target.banana = str_extract(written,"bana([^ ] )"))
例子:
written target.apple target.banana incorrect
1 apples car banana hat pencil r apples banana car hat pencil r
2 papeer apple cars spoon apple <NA> papeer cars spoon
3 dice banaana pen f apple berry apple banaana dice pen f berry
謝謝你。
uj5u.com熱心網友回復:
我們可以使用dplyrwith rowwise。一、一定要tokenize句子(分詞)
library(dplyr)
library(tokenizers)
recall %>%
rowwise() %>%
mutate(incorrect = tokenize_words(written),
incorrect = toString(incorrect[!incorrect %in% c_across(contains('target'))]))%>%
ungroup()
# A tibble: 3 × 4
written target.apple target.banana incorrect
<chr> <chr> <chr> <chr>
1 apples car banana hat pencil r apples banana car, hat, pencil, r
2 papeer apple cars spoon apple NA papeer, cars, spoon
3 dice banaana pen f apple berry apple banaana dice, pen, f, berry
uj5u.com熱心網友回復:
這些NA值使這有點棘手,因為str_remove_all()不處理NA模式(或pattern = "")。我能想到的最巧妙的處理方法是創建一個處理模式的函式str_remove_any()(通過忽略NA它們)。然后你可以做這樣的事情:
library(stringr)
library(dplyr, warn.conflicts = FALSE)
recall <- tibble(
written = c(
"apples car banana hat pencil r",
"papeer apple cars spoon",
"dice banaana pen f apple berry"
)
)
str_remove_any <- function(x, pattern) {
not_na <- !is.na(pattern)
x[not_na] <- str_remove_all(x[not_na], pattern[not_na])
x
}
recall %>%
mutate(
target.apple = str_extract(written,"app([^ ] )"),
target.banana = str_extract(written,"bana([^ ] )"),
incorrect = written %>%
str_remove_any(fixed(target.apple)) %>%
str_remove_any(fixed(target.banana))
)
#> # A tibble: 3 × 4
#> written target.apple target.banana incorrect
#> <chr> <chr> <chr> <chr>
#> 1 apples car banana hat pencil r apples banana " car hat pencil r"
#> 2 papeer apple cars spoon apple NA "papeer cars spoon"
#> 3 dice banaana pen f apple berry apple banaana "dice pen f berry"
uj5u.com熱心網友回復:
您可以簡單地洗掉 和 的所有實體,app[^ ] 并bana[^ ] 用空字串替換它們:
recall$incorrect <- gsub("appl[^ ] |bana[^ ] ", "", recall$written)
recall$incorrect
[1] " car hat pencil r" "papeer cars spoon" "dice pen f berry"
如果您的目標正則運算式很多,或者是程式生成的,您可以paste將它們一起創建匹配模式,|用作折疊分隔符
targets <- c("appl[^ ] ", "bana[^ ] ")
recall$incorrect <- gsub(paste(targets, collapse = "|"), "", recall$written)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/515282.html
標籤:r细绳重复
上一篇:將for回圈應用于變數的不同級別
