提取不允許的字符-有解無憂

我有錯誤編碼的轉錄，即出現但不應該出現的字符。

在這個玩具資料中，唯一允許的字符是這個類：

"[)(/][A-Za-z0-9↑↓￡￥°!.,:??~<>≈=_-]"

df <- data.frame(
  Utterance = c("~°maybe you (.) >should ￥just￥<",
                "SOME text |<-- pipe? and€",            # <--: | and €
                "blah%",                                # <--: %
                "text ^more text",                      # <--: ^
                "￡norm(hh)a::l￡mal, (1.22)"))

我需要做的是：

檢測Utterance包含任何錯誤編碼的 s
提取錯誤的字符

就檢測而言，我做得很好，但提取失敗了：

library(stringr)
library(dplyr)
df %>%
  filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓￡￥°!.,:??~<>≈=_-]")) %>%
  mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓￡￥°!.,:??~<>≈=_-]"))
                  Utterance                                  WrongChar
1 SOME text |<-- pipe? and€ SO, ME,  t, ex, |<, --,  p, ip, e?,  a, nd
2                     blah%                                     bl, ah
3           text ^more text                     te, xt, ^m, or,  t, ex

如何改進提取以獲得此預期結果：

                  Utterance WrongChar
1 SOME text |<-- pipe? and€      |, €
2                     blah%         %
3           text ^more text         ^

uj5u.com熱心網友回復：

你需要

確保[和]在字符類中轉義
將空白模式添加到兩個正則運算式檢查中，因為它的缺失會弄亂您的結果。

所以你需要使用

df %>%
   filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓￡￥°!.,:??~<>≈=_-]")) %>%
   mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓￡￥°!.,:??~<>≈=_-]"))

輸出：

                  Utterance WrongChar
1 SOME text |<-- pipe? and€      |, €
2                     blah%         %
3           text ^more text         ^

請注意，我在中使用了正邏輯filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓￡￥°!.,:??~<>≈=_-]"))，因此我們得到了除允許的字符之外至少包含一個字符的所有專案。

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/371591.html

標籤：r 正则表达式

上一篇：從互動式網格“行操作”按鈕中洗掉操作-OracleApexv21.1

下一篇：如何用相同的替換替換不同的圖案？