使用Spacy正則運算式的意外結果-有解無憂

我發現使用 Spacy（版本 3.1.3）匹配正則運算式的意外結果。我定義了一個簡單的正則運算式來識別數字。然后我創建由數字和字母組成的字串，然后嘗試識別。一切都按預期作業，但帶有字母 g、m 和 t：

這是一個最小的實作

import string 
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()
pattern = [{"TEXT": {"REGEX": "\d"}}]
matcher = Matcher(nlp.vocab)
matcher.add("usage",[pattern])

for l in string.ascii_lowercase:
    doc = nlp(f"2{l}")
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(l, span.text)

結果

a 2a
b 2b
c 2c
d 2d
e 2e
f 2f
g 2    # EXPECTED 2g
h 2h
i 2i
j 2j
k 2k
l 2l
m 2   # EXPECTED 2m
n 2n
o 2o
p 2p
q 2q
r 2r
s 2s
t 2   # EXPECTED 2t
u 2u
v 2v
w 2w
x 2x
y 2y
z 2z

uj5u.com熱心網友回復：

有問題的字串分為兩個標記：

2g => ['2', 'g']
2m => ['2', 'm']
2t => ['2', 't']

為了匹配模式，您需要考慮到g,m或tletter 可以是下一個標記的事實。

在這種情況下，您可以使用

import spacy
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()
pattern = [{"TEXT": {"REGEX": "\d"}}, {"TEXT": {"REGEX": "^[gmt]$"}, "OP": "?"}]
matcher = Matcher(nlp.vocab)
matcher.add("usage",[pattern])

text = "some 1.2t other stuff 1.2a"
doc = nlp(text)
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.text)

在這里，該pattern = [{"TEXT": {"REGEX": "\d"}}, {"TEXT": {"REGEX": "^[gmt]$"}, "OP": "?"}]模式首先將標記與數字匹配，然后 - 可選地（由于"OP": "?"） - 等于m,g或的標記t。spacy.util.filter_spans只保留最長的匹配。

如果您只匹配一個數字作為第一個標記，您可能會使模式更加精確。在這種情況下，更改"REGEX": "\d"為"REGEX": "^\d (?:\.\d )?[a-z]?$"（匹配5/5a或55.555/55.555a喜歡數字）或"REGEX": "^\d*\.?\d [a-z]?$"（這個也匹配.5/.5a喜歡字串），然后是第二個。或者，最好使用兩種模式：

pattern = [
    [{"TEXT": {"REGEX": "^\d (?:\.\d )?[a-z]$"}}],
    [{"TEXT": {"REGEX": "^\d (?:\.\d )?$"}}, {"TEXT": {"REGEX": "^[gmt]$"}}]
]
matcher = Matcher(nlp.vocab)
matcher.add("usage", pattern)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/323643.html

標籤：Python 正则表达式 spacy-3

上一篇：正則運算式-不匹配全文/替換其他部分

下一篇：正則運算式匹配單詞結尾或以連字符開頭