我正在嘗試從字串中提取 () 括號內的逗號分隔數字。如果單獨在一行中,我可以得到數字。但是當涉及其他周圍文本時,我似乎無法找到獲取數字的解決方案。任何幫助將不勝感激。下面是我目前在 python 中使用的代碼。
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?) [1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(') 1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
uj5u.com熱心網友回復:
洗掉^,$是阻止您獲取所有數字的原因。并且gm標志在 python 中不起作用re。
您可以將您的正則運算式更改為 :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?) [1-9][0-9]*\))如果您想分別獲取每個數字。
或者您可以在此處簡化您的模式以(?<=[(,])[1-9][0-9] (?=[,)])
測驗正則運算式:https ://regex101.com/r/RlGwve/1
Python代碼:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9] (?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9] (?=[,)])
上面的模式告訴匹配以 1-9 開頭后跟一個或多個數字的數字,前提是數字以逗號或括號開頭或結尾。
uj5u.com熱心網友回復:
這是另一種選擇:
pattern = re.compile(r"(?<=\()[1-9] \d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): 往后看([1-9] \d*:至少一個數字(也\d可以嗎?)(?:,[1-9]\d*)*: 后面的零個或多個數字,(?=\)): 前瞻)
您的結果line:
[['101065', '101066', '101067'], ['101065']]
如果您只想要逗號分隔的數字:
pattern = re.compile(r"(?<=\()[1-9] \d*(?:,[1-9]\d*) (?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*):后面的一個或多個數字,
結果:
[['101065', '101066', '101067']]
現在,如果你的線也可能看起來像
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
然后你必須撒上patternwith\s*并在之后洗掉空格(這里是str.translateand str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9] \d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
結果:
[['101065', '101066', '101067'], ['101065']]
uj5u.com熱心網友回復:
使用pypi 正則運算式模塊,您還可以使用捕獲組:
\((?P<num>\d )(?:,(?P<num>\d ))*\)
模式匹配:
\(匹配((?P<num>\d )捕獲組,匹配 1 位(?:,(?P<num>\d ))*可以選擇在捕獲組中重復匹配,和 1 個數字\)匹配)
正則運算式演示| Python 演示
示例代碼
import regex
pattern = r"\((?P<num>\d )(?:,(?P<num>\d ))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
輸出
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/457120.html
