我有以下資料框
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
從之前的編碼幫助中,我們可以使用以下代碼洗掉停用詞。
report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
我想洗掉小于特定字符長度的單詞(例如,想要洗掉少于 4 個字符的單詞,例如heiand hey)。另外需要在標記化之前洗掉手動停用詞(例如saw和kitty)和常見的噪音(空格、數字和標點符號)。最終結果將是:
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 wood 4
5 hello best 5
此處發布了有關噪音和手動停用詞的類似問題。
uj5u.com熱心網友回復:
使用前面的代碼,如果我們從洗掉nchar小于或等于 3 (with gsubfn) 的單詞開始,它應該可以作業
trimws(gsub(paste0("\\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\\b"), "",
gsub("[[:punct:]0-9] ", "",gsubfn("\\w ", function(x)
if(nchar(x) > 3) x else '', report$Text))))))
-輸出
[1] "unit crosses street" "driver speeding driver"
[3] "year year pandemic" "wood" "hello best"
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/462269.html
上一篇:從串列串列到帶有串列列的資料框
