在Python中使用正則運算式提取帶有替代項的子字串-有解無憂

我嘗試查找以前的帖子，但找不到與我正在尋找的內容完全匹配的任何內容，所以就到這里了。

我正在嘗試決議資料框中的字串并在找到匹配項時捕獲某個子字串（年份）。格式可能會有很大差異，我想出了一種不優雅的方法來完成它，但我想知道是否有更好的方法。

字串可以看起來像這樣

Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words

我正在尋找年份，目前通過查找最后日期及其年份。當前的正則運算式在. 3112(\d{2,4})|. 31\.12\.(\d{2,4})哪里

它將在第 1 組中回傳20 ，在第 2 組中010120-311220回傳20201.1.2020-31.12.2020 -。

問題是我無法事先知道匹配項屬于哪個組，因為在第一個示例中，組 2 不存在，而在第二個示例中，組 1 在使用時將回傳 None re.match(regexPattern, stringOfInterest)。因此，我無法通過天真地使用.group(1)匹配物件來訪問該值，因為有時該值會在.group(2).

到目前為止，我想出的最好的方法是為組命名(?P<groupName>\d{2,4)并檢查 Nones

def getYear(stringOfInterest):
    regexPattern = '(^|. )3112(?P<firstMatchType>\d{2,4})|(^|. )31\.12\.(?P<secondMatchType>\d{2,4})'
    matchObject = re.match(regexPattern, stringOfInterest)
    if matchObject is not None:
        matchDict = matchObject.groupdict()
        if matchDict['firstMatchType'] is not None:
            return matchDict['firstMatchType']
        else:
            return matchDict['secondMatchType']
    return None

import re
df['year'] = df['text'].apply(getYear)

雖然這可行，但直覺上似乎是一種愚蠢的做法。有任何想法嗎？

uj5u.com熱心網友回復：

看起來你所有的歲月都來自二十一^世紀。在這種情況下，您只需要

df['year'] = '20'   df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)

請參閱正則運算式演示。詳情：

.*- 盡可能多的除換行符以外的任何零個或多個字符
31\.?12\.?- 31、一個可選.的12、和一個可選的.字符
(?:\d{2})?- 可選的兩位數字序列
(\d{2})- 第 1 組：年份的最后兩位數字。

查看 Pandas 測驗：

import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20'   df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)

輸出：

>>> df
                               text  year
0            Random Text 31.12.2020  2020
1                  1.1. -31.12.2020  2020
2                     010120-311220  2020
3                        31.12.2020  2020
4             1.1.2020-31.12.2020 -  2020
5             1.1.2019 - 31.12.2019  2019
6             1.1. . . 31.12.2019 -  2019
7            1.1.2019 - -31.12.2019  2019
8  010120-311220 other random words  2020

uj5u.com熱心網友回復：

我們可以嘗試re.findall在此處對您的輸入串列使用，并使用涵蓋兩種變體的正則運算式交替：

inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output)  # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']

這里的策略是匹配兩個日期變體中的任何一個。我們保留每個輸入的最后一個匹配項。然后，我們使用串列推導來查找非空值。請注意，有兩個捕獲組，因此只有一個會匹配。

uj5u.com熱心網友回復：

您的正則運算式可以通過僅對日期開頭的交替進行分組來進行很多分解；這消除了檢查兩組的需要：

regexPattern = r'(?:^|. )(?:3112|31\.12\.)(?P<year>\d{2,4})'

提取組后，可以將其歸一化為適當的四位數年份：

if matchObject is not None:
    return ('20'   matchObject.group('year'))[-4:]

總而言之，我們得到：

import re

def getYear(stringOfInterest):
    regexPattern = r'(?:^|. )(?:3112|31\.12\.)(?P<year>\d{2,4})'
    matchObject = re.match(regexPattern, stringOfInterest)
    if matchObject is not None:
        return ('20'   matchObject.group('year'))[-4:]
    return None

df['year'] = df['text'].apply(getYear)

uj5u.com熱心網友回復：

這是我解決問題的方法，也許會有用


import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)

# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)

#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/444965.html

標籤：Python 正则表达式

上一篇：正則運算式重復可選組

下一篇：獲取兩個運算式之間的所有匹配項