正則運算式：匹配字串中的多個時間戳-有解無憂

我有一個文本檔案，它在開始時逐行詳細說明時間戳，并且可能包含其他時間戳。第一個時間戳始終包含在中[]，而位于行中間的時間戳始終包含在中<>。目標是創建一個正則運算式模式，該模式可以為時間戳及其后的文本創建組。我對正則運算式很陌生，我很難適應它。文本看起來像這樣：

[00:22.88]Lorem <11:53.82>ipsum dolor sit amet, consectetur <98:23.52>adipiscing elit
[00:34.08]eiusmod <00:42.52>tempor incididunt ut <10:67.58>labore et dolore

但是，這些行被一一輸入正則運算式，因此無需考慮其他行（盡管需要某種例外來匹配行尾或檔案末尾的換行符... ）。

所需的輸出將是這樣的（對于每一行）：

[('00:22.88', 'Lorem '), ('11:53.82', 'ipsum dolor sit amet, consectetur '), ('98:23.52', 'adipiscing elit')]

例如，這種模式適用于第一個時間戳：

\[(\d{2}:\d{2}.\d{2})\]\s*(. )

其余的，我不知道該怎么做，我嘗試|在括號和小于符號之間添加以使其匹配“這個或那個”，但沒有用：

\[|<(\d{2}:\d{2}.\d{2})\]|>(. )

我也試過這個，試圖匹配時間戳之間的任何東西，它也不起作用。

\[(\d{2}:\d{2}.\d{2})\]\s*<([0-9] :[0-9.]*)>\s*(. )\s*

如果有更多正則運算式經驗的人可以幫我一把，我將不勝感激，我不知道如何解決這個問題。我確實找到了一個非常酷的網站來撰寫正則運算式模式，這在嘗試撰寫自己的模式時非常有用：https : //regexr.com/

uj5u.com熱心網友回復：

使用純正則運算式拆分我會使用以下內容。正則運算式匹配<或[后跟您的數字模式，然后>或]為時間戳。對于內容所花費的一切，直到第一<和[occurres。

import re

regex = r"(?:<|\[)([\d]{2}:[\d]{2}\.[\d]{2})(?:\]|>)([^<\[] )"

test_str = ("[00:22.88]Lorem <11:53.82>ipsum dolor sit amet, consectetur <98:23.52>adipiscing elit\n"
    "[00:34.08]eiusmod <00:42.52>tempor incididunt ut <10:67.58>labore et dolore")

matches = re.finditer(regex, test_str, re.MULTILINE)

found = []

for matchNum, match in enumerate(matches, start=1):
    found.append((match.group(1).strip(), match.group(2).strip()))
    
print(found)

上面的正則運算式可以通過以下鏈接進行可視化和除錯：https ://regex101.com/r/Pyr2J4/1

上面的正則運算式可能對你來說已經足夠了，但如果文本包含一個<或[（例如“Lorem < ipsum ...”），它就會失敗。如果您也希望能夠處理這些，我建議只匹配時間戳，然后將匹配之間的其余文本作為內容。此外，下面的正則運算式不支持像[00:00.00>上面那樣的時間戳。這需要更多的python：

import re

regex = r"<[\d]{2}:[\d]{2}\.[\d]{2}>|\[[\d]{2}:[\d]{2}\.[\d]{2}\]"

test_str = ("[00:22.88]Lorem <11:53.82>ipsum dolor sit amet, consectetur <98:23.52>adipiscing elit\n"
    "[00:34.08]eiusmod <00:42.52>tempor incididunt ut <10:67.58>labore et dolore")

matches = re.finditer(regex, test_str, re.MULTILINE)

found = []
last_match_end = None

for matchNum, match in enumerate(matches, start=1):
    if len(found) > 0 and last_match_end is not None:
        # add the text from the end of the last match to the start of the 
        # current match as the text of the last match (to previous list value)
        found[-1].append(test_str[last_match_end:match.start()].strip())
        
    # take the timestamp (=match) from the current match
    found.append([match.group().strip("<>[]")])
    # save end of this match
    last_match_end = match.end()
    
if len(found) > 0 and last_match_end is not None:
    # add missing text of last match
    found[-1].append(test_str[last_match_end:].strip())

print(found)

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/316077.html

標籤：Python 正则表达式

上一篇：使用javascript重新排序日期的正則運算式

下一篇：用RustRegex替換2個字符之間的所有內容