在標記化之前洗掉小于特定字符長度的單詞并降噪-有解無憂

我有以下資料框

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
report 
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

從之前的編碼幫助中，我們可以使用以下代碼洗掉停用詞。

report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b', 
                          collapse = '|'), '', report$Text)
report
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

我想洗掉小于特定字符長度的單詞（例如，想要洗掉少于 4 個字符的單詞，例如heiand hey）。另外需要在標記化之前洗掉手動停用詞（例如saw和kitty）和常見的噪音（空格、數字和標點符號）。最終結果將是：

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                                   wood  4
5                             hello best  5

此處發布了有關噪音和手動停用詞的類似問題。

uj5u.com熱心網友回復：

使用前面的代碼，如果我們從洗掉nchar小于或等于 3 (with gsubfn) 的單詞開始，它應該可以作業

trimws(gsub(paste0("\\b(", paste(union(c("saw", "kityy"), 
   tm::stopwords("english")), collapse="|"), ")\\b"), "", 
     gsub("[[:punct:]0-9] ", "",gsubfn("\\w ", function(x) 
     if(nchar(x) > 3) x else '', report$Text))))))

-輸出

[1] "unit crosses street"    "driver speeding driver" 
[3] "year year pandemic"     "wood"                   "hello best"

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/462269.html

標籤：r nlp 文本挖掘 Tm值停用词

上一篇：從串列串列到帶有串列列的資料框

下一篇：在標記化之前洗掉數字、標點符號、空格