我有以下資料框
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
從之前的編碼幫助中,我們可以使用以下代碼洗掉停用詞。
report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
上面的資料仍然有噪音(數字、標點符號和空格)。需要在標記化之前通過去除這些噪音來獲得以下格式的資料。此外,我想洗掉選定的停用詞(例如,saw和kitty)。
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 hey hei hei wood 4
5 hello best 5
uj5u.com熱心網友回復:
我們可能會得到unionoftm::stopwords和新條目,paste它們帶有collapse = "|",洗掉帶有替換""的條目,gsub以及洗掉標點符號和數字以及多余的空格(\\s - 一個或多個空格)
trimws(gsub("\\s ", " ",
gsub(paste0("\\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\\b"), "",
gsub("[[:punct:]0-9] ", "", report$Text))
))
-輸出
[1] "unit crosses street"
[2 "driver speeding driver"
[3] "year year pandemic"
[4] "hey hei hei wood"
[5] "hello best"
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/462270.html
下一篇:將串列轉換為資料框
