在pythonpandas中提取子字串的正則運算式-有解無憂

我在下面有一個資料框列名稱“新建”

df = pd.DataFrame({'New' : ['emerald shines bright(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)'],
'UI': ['AOT', 'BOT', 'LOV', 'HAP']})

現在我想將“http”中的各種 ID（例如 ABCED'、AxYBD 和 id）提取到另一列中。

但是當我使用

df['New_col'] = df['New'].str.extract(r'.*\((.*)\).*',expand=True)

我不能讓它很好地作業，例如(ABCED ID - 1234556)回傳整個括號。234555更重要的是，不會回傳http id 。

此外，有人可以清理第一列以洗掉括號中的 ID 并有類似的東西，

                               New            UI    New_col
0  emerald shines bright                      AOT    1234556
1   honey in the bread                        BOT  123467890
2        http/ABCED/id/234555                 LOV     234555
3        healing strenght                     HAP    1234556

uj5u.com熱心網友回復：

您可以使用以下代碼執行此操作：

reg_expression = r'.*\(.*ID\s*-\s*(.*)\)|http\/.*\/id\/(\d*)'
extract_text = lambda row: row[0][0] if row[0][0] else row[0][1]

df['New_col'] = df['New'].str.findall(reg_expression).apply(extract_text)

輸出：

在python pandas中提取子字串的正則運算式

解釋：

根據您的虛擬示例，您必須捕獲兩種模式：

HTTP 案例模式http\/.*\/id\/(\d*)

例如http/ABCED/id/234555
沒有 HTTP 案例模式：.*\(.*ID\s*-\s*(.*)\)

例如emerald shines bright(ABCED ID - 1234556)

|并使用 or ( ) 運算子將它們組合在一個正則運算式中。

然后因為有多個匹配項，我們可以使用 lambda 函式從匹配項中獲取值。

uj5u.com熱心網友回復：

可能不是最優雅的答案，但是，我認為
根據新標準，這可以滿足您的要求。

import re

df = pd.DataFrame({'New' : ['emerald shines bright(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)'],
'UI': ['AOT', 'BOT', 'LOV', 'HAP']})

# Function to extract the ID from each row
def grab_id(row):
    text = re.findall(r'\(([A-Za-z] )\sID\s-\s?(\d )\)|/([0-9] )', row)
    if text[0][0]:
        return text[0][1]
    else:
        return text[0][2]
    
# Function to remove the ID from the 'New' column    
def remove_ID_in_brackets(row):
    text = re.sub(r'\(.*\)', '', row)
    return text

df['New_Col'] = df['New'].apply(grab_id)
df['New'] = df['New'].apply(remove_ID_in_brackets)

這就是df現在的樣子：

在python pandas中提取子字串的正則運算式

uj5u.com熱心網友回復：

您可以使用

import pandas as pd
df = pd.DataFrame({'New' : ['emerald shines bright(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)'], 'UI': ['AOT', 'BOT', 'LOV', 'HAP']})
df['New_col'] = df['New'].str.extract(r'.*(?:\(\D*|http\S*/id/)(\d )',expand=False)

輸出：

>>> print(df.to_string())
                                         New   UI    New_col
0  emerald shines bright(ABCED ID - 1234556)  AOT    1234556
1   honey in the bread(ABCED ID - 123467890)  BOT  123467890
2                       http/ABCED/id/234555  LOV     234555
3        healing strenght(AxYBD ID -1234556)  HAP    1234556

請參閱正則運算式演示。詳情：

.*- 盡可能多的除換行符以外的任何零個或多個字符
(?:\(\D*|http\S*/id/)- 要么( 零個或多個非數字字符，要么http后跟零個或多個非空格，然后/id/
(\d )- 第 1 組：一位或多位數字。

uj5u.com熱心網友回復：

r'[i,d,I,D]{2}.*?(\d.*?)\D'可能這有幫助

編輯：/?\(?(\w{5}) ?/?[i,d,I,D]{2}看起來你需要字母，而不是數字

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/522310.html

標籤：Python熊猫正则表达式子串

上一篇：如果前面沒有空格或IS后跟空格，則在連字符處拆分

下一篇：在正則運算式中應用前瞻，應遵循指定的模式并給出匹配，否則為不匹配