使用正則運算式將句子分成單詞-有解無憂

我想使用正則運算式將一個句子分成單詞，我正在使用以下代碼：

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
sentence = re.split('\s|,|>|<|\[|\]:', sentence)

但我得到的不是我在等待的

預期輸出是：

['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

但我得到的是：

['', '30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', '', 'tester-test.service:', 'activation', 'successfully.']

我實際上試圖忽略空格，但實際上它應該只在最后一個長字中被忽略，我不知道我該怎么做..任何建議/幫助提前謝謝你

uj5u.com熱心網友回復：

您可以使用

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
chunks = sentence.split(': ', 1)
result = re.findall(r'[^][\s,<>] ', chunks[0])
result.append(chunks[1])
print(result)
# => ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

查看Python 演示

這里，

chunks = sentence.split(': ', 1): - 使用第一個子字串將句子分成兩個塊
result = re.findall(r'[^][\s,<>] ', chunks[0])-從第一個塊中提取由一個或多個字符組成的所有子字串，除了], [, whitespace,,和<chars>
result.append(chunks[1])- 將第二個塊附加到result串列中。

uj5u.com熱心網友回復：

從您的示例的“預期輸出”中可以看出，只要遇到一個字符，該字符前面是由': '該字符組成的字串，并且后面的所有內容（到字串的末尾）都將被回傳。我認為這是規則之一。

這向我表明，您希望回傳匹配項（而不是拆分的結果），并且要匹配的正則運算式應該是兩部分交替（即具有 form ...|...），第一部分是

(?<=: ).

上面寫著“貪婪地匹配一個或多個字符，第一個字符前面是冒號，后面是空格”。(?<=: )是一個積極的回顧。

在到達第一個前面有冒號后跟空格的字符之前，我們需要匹配由數字、字母和連字符組成的字串，以及冒號前面和后面的數字。因此，所需的正則運算式是

rgx = r'(?<=: ). |(?:[\da-zA-Z-]|(?<=\d):(?=\d)) '

因此你可以寫

str = "<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully."

re.findall(rgx, str)
  #=> ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd',
  #    '1', 'tester-test.service: activation successfully.']

Python 演示^_<-_\(ツ)/^_->正則運算式演示

正則運算式的組成部分如下。

(?<=: )        # the preceding string must be ': '
.              # match one or more characters (greedily)
|              # or
(?:            # begin a non-capture group
  [\da-zA-Z-]  # match one character in the character class
  |            # or
  (?<=\d)      # the previous character must be a digit
  :            # match literal
  (?=\d)       # the next character must be a digit
)              # end the non-capture group and execute one or more times

(?=\d)是一個積極的前瞻。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/436615.html

標籤：Python 正则表达式解析

上一篇：使用pegen開發決議器：無輸出

下一篇：R：將資料框資料型別中的多個選擇列更改為日期格式