正則運算式匹配2個子字串之間的段落-有解無憂

我有一個像這樣的字串：

string=""
( 2021-07-10 01:24:55 PM GMT )TEST  
---  
Badminton is a racquet sport played using racquets to hit a shuttlecock across
a net. Although it may be played with larger teams, the most common forms of
the game are "singles" (with one player per side) and "doubles" (with two
players per side).  
  
  

  

( 2021-07-10 01:27:55 PM GMT )PATRICKWARR  
---  
Good morning, I am doing well. And you?  
  
  

  
  
  
---  
  
  
  
  
---  
  
* * *""

我正在嘗試將字串拆分為以下部分：

text=['羽毛球是一種使用球拍擊打球網的球拍運動。雖然它可能與較大的球隊一起進行，但最常見的比賽形式是“單打”（每邊一名球員）和“雙打”（每邊兩名球員）。','早上好，我做得很好。和你？']

我嘗試過的內容：

text=re.findall(r'\( \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} PM GMT \)\w   [\S\n]---  .*',string)

我不知道如何提取多行。

uj5u.com熱心網友回復：

您可以使用

(?m)^\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n---\s*\n(.*(?:\n(?!(?:\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n)?---).*)*)

請參閱正則運算式演示。詳情：

^ - 行首
{left_rx} - 左邊界
--- - 三個連字符
\s*\n - 零個或多個空格，然后是 LF 字符
(.*(?:\n(?!(?:{left_rx})?---).*)*) - 第 1 組：
- .* - 盡可能多的除換行符以外的零個或多個字符
- (?:\n(?!(?:{left_rx})?---).*)*- 零個或多個（甚至是空的，由于.*）不以（可選）左邊界模式開頭的行，然后是---

中定義的邊界模式left_rx是\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n，它和原來的基本相同，我用來\s*匹配任何零個或多個空格或\s 匹配“單詞”之間的一個或多個空格。

請參閱Python 演示：

import re
text = '''string=""\n( 2021-07-10 01:24:55 PM GMT )TEST  \n---  \nBadminton is a racquet sport played using racquets to hit a shuttlecock across\na net. Although it may be played with larger teams, the most common forms of\nthe game are "singles" (with one player per side) and "doubles" (with two\nplayers per side).  \n  \n  \n\n  \n\n( 2021-07-10 01:27:55 PM GMT )PATRICKWARR  \n---  \nGood morning, I am doing well. And you?  \n  \n  \n\n  \n  \n  \n---  \n  \n  \n  \n  \n---  \n  \n* * *""'''
left_rx = r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s GMT\s*\)\w \s*\n"
rx = re.compile(fr"^{left_rx}---\s*\n(.*(?:\n(?!(?:{left_rx})?---).*)*)", re.M)
print ( [x.strip().replace('\n', ' ') for x in rx.findall(text)] )

輸出：

['Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).', 'Good morning, I am doing well. And you?']

uj5u.com熱心網友回復：

方法之一：

import re
# Replace all \n with ''
string = string.replace('\n', '')

# Replace the date string '( 2021-07-10 01:27:55 PM GMT )PATRICKWARR ' and string like '* * *' with ''
string = re.sub(r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} [AP]M GMT\s*\)\w |\* ", '', string)

data = string.split('---')
data = [item.strip() for item in data if item.strip()]
print (data)

輸出：

['Badminton is a racquet sport played using racquets to hit a shuttlecock acrossa net. Although it may be played with larger teams, the most common forms ofthe game are "singles" (with one player per side) and "doubles" (with twoplayers per side).', 'Good morning, I am doing well. And you?']

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/341429.html

標籤：蟒蛇-3.x 正则表达式细绳

上一篇：從記事本匯入資料到python字典

下一篇：從資料框列python轉換所有日期