從大字串中提取子字串-有解無憂

我有一個字串：

string="(2021-07-02 01:00:00 AM BST)  
---  
syl.hs has joined the conversation  
  
  

(2021-07-02 01:00:23 AM BST)  
---  
e.wang  
Good Morning
How're you?
  
  
  

(2021-07-02 01:05:11 AM BST)  
---  
wk.wang  
Hi, I'm Good.  
  
  

(2021-07-02 01:08:01 AM BST)  
---  
perter.derrek   
we got the update on work. 
It will get complete by next week.

(2021-07-15 08:59:41 PM BST)  
---  
ad.ft has left the conversation  
  
  
  
  
---  
  
* * *"

我只想提取對話文本（名稱和時間戳之間的文本）預期輸出為：

評論=['早安，你好嗎？','嗨，我很好。','我們收到了作業更新。下周將完成。']

我嘗試過的是：

評論=re.findall(r'---\s*\n(. (?:\n(?!(?:(\s \d{4}-\d{2}-\d{2}\ s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*)\w \s*\n)?---). ) )' ，細繩）

uj5u.com熱心網友回復：

您可以使用單個捕獲組：

^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)

模式匹配：

^ 字串的開始
---\s*\n匹配---可選的空白字符和換行符
(?!.* has (?:joined|left) the conversation|\* \* \*)斷言該行不包含 ahas joined或has left會話部分，或包含* * *
\S.* 在行首和行的其余部分至少匹配一個非空白字符
(捕獲組 1（這將由 re.findall 回傳）
- (?:\n(?!\(\d|---).*)*匹配所有不(以數字開頭的行或-
) 關閉第 1 組

請參閱正則運算式演示和Python 演示。

例子

pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)

輸出

["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']

uj5u.com熱心網友回復：

我假設：

感興趣的文本在三行塊之后開始：一行包含時間戳，然后是行"---"，可以用空格填充右側，然后是由包含一個句點的字母串組成的行，既不是 at該字串的開頭或結尾，該字串的右側可以用空格填充。
感興趣的文本塊可能包含空行，空行是只包含空格和行終止符的字串。
感興趣的文本塊的最后一行不能是空行。

我相信以下正則運算式（設定了多行 ( m) 和大小寫無關 ( i) 標志）滿足這些要求。

^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z] \.[a-z]  *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)

感興趣的線塊包含在捕獲組 1 中。

啟動你的引擎！

運算式的元素如下。

^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n  # match timestamp line
-{3} *\r?\n                         # match 3-hyphen line
[a-z] \.[a-z]  *\r?\n               # match name
(                                   # begin capture group 1
  (?:                               # begin non-capture group (a)
    .*[^ (\n].*\r?\n                # match a non-blank line
    |                               # or
    \ *\r?\n                        # match a blank line
    (?=                             # begin a positive lookahead
      (?:                           # begin non-capture group (b)
        \ *\r?\n                    # match a blank line
      )*                            # end non-capture group b and execute 0  times
      (?!                           # begin a negative lookahead
        \(\d{4}\-\d{2}\-\d{2} .*\)  # match timestamp line
      )                             # end negative lookahead
      .*[^ (\n]                     # march a non-blank line
    )                               # end positive lookahead
  )*                                # end non-capture group a and execute 0  times
)                                   # end capture group 1

uj5u.com熱心網友回復：

這是一個自我記錄的正則運算式，它將去除前導和尾隨空格：

(?x)(?m)(?s)                                                    # re.X, re.M, re.S (DOTALL)
(?:                                                             # start of non capturing group
 ^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n  # date and time
 (?!---\s*\r?\nad\.ft has)                                      # next lines are not the ---\n\ad.ft etc.
 ---\s*\r?\n                                                    # --- line
 [\w.] \s*\r?\n                                                 # name line
 \s*                                                            # skip leading whitespace
)                                                               # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace

請參閱正則運算式演示

見 Python 演示

import re

string = """(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation



(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?




(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.



(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.

(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation




---

* * *"""

regex = r'''(?x)(?m)(?s)                                        # re.X, re.M, re.S (DOTALL)
(?:                                                             # start of non capturing group
 ^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n  # date and time
 (?!---\s*\r?\nad\.ft has)                                      # next lines are not the ---\n\ad.ft etc.
 ---\s*\r?\n                                                    # --- line
 [\w.] \s*\r?\n                                                 # name line
 \s*                                                            # skip leading whitespace
)                                                               # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
'''

matches = re.findall(regex, string)
print(matches)

印刷：

["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work.\nIt will get complete by next week.']

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/356557.html

標籤：蟒蛇-3.x 正则表达式细绳

上一篇：使用Java在給定字串中最后兩次出現分號（;）

下一篇：顯示資料中的PHP字串問題