正則運算式：查找所有帶有參考的句子-有解無憂

我發現這段代碼可以檢測文本中的所有參考：

author = r"(?:[A-Z][A-Za-z'`-] )"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9] )?"  
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'

它實際上作業得很好，但我需要找到參考所在的所有句子（從一個點開始到結尾，另一個點）。所以在這個例子中：

"Nothing is here. In this line, actually, there is a ciation (Author et al., 2022). Once again, In this line there is nothing."

我想得到這個 "In this line, actually, there is a ciation (Author et al., 2022)."

我應該如何編輯上面的代碼來實作這一點？

uj5u.com熱心網友回復：

您可以使用以下正則運算式：

r"\s*([^.] (?=\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?))(\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?)"

證明在這里。

uj5u.com熱心網友回復：

您需要分兩步解決問題：a) 將文本分解成句子，b) 檢測帶有參考的句子。句子標記化要做正確的事情并非易事，因此請使用庫來做。例如：

>>> import nltk
>>> text = "Nothing is here. In this line, actually, there is a citation (Author et al., 2022). Once again, In this line there is nothing."
>>> sentences = nltk.sent_tokenize(text)
>>> print(sentences)
['Nothing is here.', 'In this line, actually, there is a citation (Author et al., 2022).', 'Once again, In this line there is nothing.']

然后，使用您的定義：

>>> citation = fr"{author}{additional}*{year}" 
>>> for s in sentences:
>>> ...     if re.search(citation, s):
>>> ...             print(s)
>>> ... 
In this line, actually, there is a citation (Author et al., 2022).

PS。如果您以前從未使用過 nltk，則需要一次性下載句子標記器。您將看到一條錯誤訊息，告訴您運行此程式，只需執行一次，您就永遠完成了。

nltk.download('punkt')

uj5u.com熱心網友回復：

試試這個：

(?<=\. )[^(] \(([^)] )\).*?\.

解釋：

(?<=\. )：lookbehind 檢查先前的點和空格序列
[^(\.] : 除開括號和點之外的任何字符組合
\( : 開括號
([^)] ): 除右括號外的任意字符組合
\) : 右括號
.*? : 可選的惰性字符組合
\. : 點和空格的順序

此解決方案無法解決的極端情況：

<space><dot><word>(like .dotnet) 是括號前的一個內部詞：它總是將<space><dot>其視為句子的開頭。
<word><dot><space>(like e.g.) 是括號后的一個內部詞：它總是將<dot><space>其視為句末。

解決這些極端情況的一種可能性是首先進行一些預處理并轉換/洗掉原始文本中存在的任何縮寫。

在這里試試。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/475174.html

標籤：Python 正则表达式

上一篇：JS正則運算式路徑名匹配

下一篇：正則運算式匹配Python檔案字串