如何跳過或更正python中正則運算式無法識別的壞字符？[復制]-有解無憂

這個問題在這里已經有了答案：轉義正則運算式字串 4 個答案 13 小時前關閉。

我想從每個字串（句子）中替換子字串（單詞）。我首先從檔案中加載這些單詞以組成正則運算式模式，然后進行如下搜索和替換：

words = load_stopwords('word.txt'))
PATTERN = '['   '|'.join(words)   ']'

def remove_words(text):

    if not text:
        return text
    
    return re.sub(PATTERN, '', text)

我從檔案中加載了大約 5000 個單詞，并且檔案中可能有一些正則運算式不喜歡的壞字符，因為它會產生這個錯誤：

File "/usr/local/Cellar/[email protected]/3.7.12/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/local/Cellar/[email protected]/3.7.12/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "/usr/local/Cellar/[email protected]/3.7.12/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 574, in _parse
    raise source.error(msg, len(this)   1   len(that))
re.error: bad character range |-R at position 535

假設我不能事先清理單詞串列，即使模式中有一些壞字符，我怎樣才能使 re.sub() 作業？

比如我手動發現word檔案中也有這樣的字符：

|
-

還有一些表情符號字符，我確實想像其他常用詞一樣洗掉它們。也許我應該修改構成 PATTERN 的方式并讓正則運算式在匹配中適應它們？但是可能會有很多不同的特殊字符，比如這些。

補充：采用蒂姆的方法后，我得到了一個不同的錯誤：

File "/usr/local/Cellar/[email protected]/3.7.12/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 645, in _parse
    source.tell() - here   len(this))
re.error: nothing to repeat at position 701

uj5u.com熱心網友回復：

你在這里想要的是一個正則運算式替換，它采用(?:word1|word2|word3).

words = load_stopwords('word.txt'))
words = [re.escape(x) for x in words]
PATTERN = r'\s*\b(?:'   '|'.join(words)   r')\b\s*'

def remove_words(text):
    if not text:
        return text

    return re.sub(PATTERN,' ', text)

您使用[word1|word2|word3]的 which 是一個字符類，表示任何單詞中的任何字符。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/418904.html

標籤：

上一篇：正則運算式掃描檔案以獲取特定內容

下一篇：Wordle的正則運算式