使用模式將文本分隔為單獨的元素-javascript-有解無憂

提前為我的殘暴代碼道歉/試圖解釋我想要實作的目標......

我想獲取帶有時間戳的各種成績單，并將其轉換為一致的格式以創建字幕。抄本的來源不同，檔案的結構和時間戳也不同，有時甚至在同一個檔案中。

時間戳的格式是 [HH:MM:SS.FF]（我可以處理的變化）并且它包含在文本中。時間戳有時會指示終點（通常它們只是起點）。

所以格式是

[Timestamp1]Some text with various line breaks and weird characters.
[Timestamp2]More text where this transcript continues but ends with some silence after this
[Timestamp3]
[Timestamp4]The next sentence begins and ends at the last
[Timestamp5]

JavaScript 中最好的編碼方法是什么？我已經用 string.split 和 re.matchAll 繞過了房子，但我想出的正則運算式模式都不能連續處理 2 個時間戳。

我認為理想情況下，我將擁有獲取時間戳的正則運算式模式，然后存盤具有開始和結束時間戳的物件陣列（如果結束不存在，則結束是下一個開始）和相關文本。

所以對于上面的例子，我有

Start: Timestamp1 End: Timestamp2 Text: "Some text..."

Start: Timestamp2 End: Timestamp3 Text: "More text..."

Start: Timestamp4 End: Timestamp5 Text: "The next..."

這是我最近的嘗試之一...

function test(){
        str = 
        `[09:35:10.00]
        1. Lorem ipsum...
        [09:35:13.11]
        [09:35:15.14]
        2. sed do eiusmod...
        [09:35:39.20]
        3. anim id est laborum...
        [09:35:43.17]`

        var re = /(?<tc1>\[?(?:[0-1][0-9]|2[0-3]|[0-9]):(?:[0-5][0-9]):(?:[0-5][0-9])(?:\.(?:[0-9]{2,3})?\]?))\s*(.*)\s*(?<tc2>\[?(?:[0-1][0-9]|2[0-3]|[0-9]):(?:[0-5][0-9]):(?:[0-5][0-9])(?:\.(?:[0-9]{2,3})?\]?))?.*/gm;

        const matches = str.matchAll(re);
        for (const match of matches) {
                console.log(`Start TC:\n${match[1]}\nText:\n${match[2]}\nTC2:\n${match[3]}`);
        }
}

不幸的是，這不能滿足變化。

感謝您提供正確方向的任何指示。

uj5u.com熱心網友回復：

該模式需要由3部分組成：

匹配并捕獲時間戳：[，后跟數字、冒號和句點：\[\d{2}:\d{2}:\d{2}\.\d{2}\]
匹配并捕獲時間戳以外的任何字符：上面的模式在(?:(?!TIMESTAMP).) 哪里TIMESTAMP
向前看并捕獲時間戳：只需使用上面的時間戳模式

您必須提前尋找時間戳而不是正常匹配它，因為有問題的時間戳可能需要成為下一場比賽的一部分。

把它放在一起，你會得到：

str =
  `[09:35:10.00]
        1. Lorem ipsum...
        [09:35:13.11]
        [09:35:15.14]
        2. sed do eiusmod...
        [09:35:39.20]
        3. anim id est laborum...
        [09:35:43.17]`

var re = /(\[\d{2}:\d{2}:\d{2}\.\d{2}\])((?:(?!\[\d{2}:\d{2}:\d{2}\.\d{2}\]).) )(?=(\[\d{2}:\d{2}:\d{2}\.\d{2}\]))/gs;

const matches = str.matchAll(re);
for (const match of matches) {
  console.log(`Start TC:\n${match[1]}\nText:\n${match[2]}\nTC2:\n${match[3]}`);
}

或者，評論正則運算式：

const pattern = makeExtendedRegExp(String.raw`
( # First capture group: timestamp
  \[\d{2}:\d{2}:\d{2}\.\d{2}\]
)
( # Second capture group: text
  (?:(?!
    # Timestamp pattern again:
    \[\d{2}:\d{2}:\d{2}\.\d{2}\]
  ).) 
)
(?=( # Look ahead for and capture the timestamp in 3rd group:
  # Timestamp pattern again:
  \[\d{2}:\d{2}:\d{2}\.\d{2}\]
))
`, 'gs');



function makeExtendedRegExp(inputPatternStr, flags) {
  const cleanedPatternStr = inputPatternStr
    .replace(/(^|[^\\]) *#.*/g, '$1')
    .replace(/^\s |\s $|\n/gm, '');
  return new RegExp(cleanedPatternStr, flags);
}


str =
  `[09:35:10.00]
        1. Lorem ipsum...
        [09:35:13.11]
        [09:35:15.14]
        2. sed do eiusmod...
        [09:35:39.20]
        3. anim id est laborum...
        [09:35:43.17]`

const matches = str.matchAll(pattern);
for (const match of matches) {
  console.log(`Start TC:\n${match[1]}\nText:\n${match[2]}\nTC2:\n${match[3]}`);
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/324854.html

標籤：javascript 解析迭代

上一篇：無法在網路爬蟲中縮小搜索條件來搜索“職位”并計算每個職位的頻率

下一篇：從csv檔案創建頁面