我有一個字串:
string="(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"
我只想提取對話文本(名稱和時間戳之間的文本)預期輸出為:
評論=['早安,你好嗎?','嗨,我很好。','我們收到了作業更新。下周將完成。']
我嘗試過的是:
評論=re.findall(r'---\s*\n(. (?:\n(?!(?:(\s \d{4}-\d{2}-\d{2}\ s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*)\w \s*\n)?---). ) )' ,細繩)
uj5u.com熱心網友回復:
您可以使用單個捕獲組:
^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)
模式匹配:
^字串的開始---\s*\n匹配---可選的空白字符和換行符(?!.* has (?:joined|left) the conversation|\* \* \*)斷言該行不包含 ahas joined或has left會話部分,或包含* * *\S.*在行首和行的其余部分至少匹配一個非空白字符(捕獲組 1(這將由 re.findall 回傳)(?:\n(?!\(\d|---).*)*匹配所有不(以數字開頭的行或-
)關閉第 1 組
請參閱正則運算式演示和Python 演示。
例子
pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)
輸出
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']
uj5u.com熱心網友回復:
我假設:
- 感興趣的文本在三行塊之后開始:一行包含時間戳,然后是行
"---",可以用空格填充右側,然后是由包含一個句點的字母串組成的行,既不是 at該字串的開頭或結尾,該字串的右側可以用空格填充。 - 感興趣的文本塊可能包含空行,空行是只包含空格和行終止符的字串。
- 感興趣的文本塊的最后一行不能是空行。
我相信以下正則運算式(設定了多行 ( m) 和大小寫無關 ( i) 標志)滿足這些要求。
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z] \.[a-z] *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)
感興趣的線塊包含在捕獲組 1 中。
啟動你的引擎!
運算式的元素如下。
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n # match timestamp line
-{3} *\r?\n # match 3-hyphen line
[a-z] \.[a-z] *\r?\n # match name
( # begin capture group 1
(?: # begin non-capture group (a)
.*[^ (\n].*\r?\n # match a non-blank line
| # or
\ *\r?\n # match a blank line
(?= # begin a positive lookahead
(?: # begin non-capture group (b)
\ *\r?\n # match a blank line
)* # end non-capture group b and execute 0 times
(?! # begin a negative lookahead
\(\d{4}\-\d{2}\-\d{2} .*\) # match timestamp line
) # end negative lookahead
.*[^ (\n] # march a non-blank line
) # end positive lookahead
)* # end non-capture group a and execute 0 times
) # end capture group 1
uj5u.com熱心網友回復:
這是一個自我記錄的正則運算式,它將去除前導和尾隨空格:
(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.] \s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
請參閱正則運算式演示
見 Python 演示
import re
string = """(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"""
regex = r'''(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.] \s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
'''
matches = re.findall(regex, string)
print(matches)
印刷:
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work.\nIt will get complete by next week.']
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/356557.html
下一篇:顯示資料中的PHP字串問題
