提取并創建一個新的電子郵件列資料框pyspark-有解無憂

我有一個 df 列，其中包含我不想要的電子郵件和更多資訊。這里有些例子：

                                       Email_Col
"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>
"YourBoss" <[email protected]>
"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>

我需要清理列或使用電子郵件創建一個新列。這里是預期的輸出，一個陣列列：

                           New_Email_Col
[[email protected], Stark, [email protected]]
[[email protected]]
[[email protected] [email protected], [email protected]]

我的代碼：

import re

def extract(col):
    for row in col:
        all_matches = re.findall(r'\w .\w @\w .\w ', row)
    return all_matches

extract_udf = udf(lambda col: extract(col), ArrayType(StringType()))

df = df.withColumn(('emails'), extract_udf(col('to')))

我的錯誤：

PythonException：'TypeError：預期的字串或類似位元組的物件'，來自第 4 行。下面的完整回溯

uj5u.com熱心網友回復：

請不要udf- 它們很慢，現在在絕大多數情況下都不需要。以下是訣竅：

F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)")

regexp_extract_all可從 Spark 3.1

完整示例：

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('''"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>''',),
     ('''"YourBoss" <[email protected]>''',),
     ('''"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>''',)],
    ['Email_Col'])

df = df.withColumn('Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))

df.show(truncate=0)
#  -------------------------------------------------------------------- 
# |Email_Col                                                           |
#  -------------------------------------------------------------------- 
# |[[email protected], [email protected]]                    |
# |[[email protected]]                                 |
# |[[email protected], [email protected], [email protected]]|
#  --------------------------------------------------------------------

要添加單獨的新列：

df = df.withColumn('New_Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/518516.html

標籤：Python正则表达式数据框阿帕奇火花pyspark

上一篇：如何使用正則運算式在python中洗掉模式之前和之后的空格？

下一篇：創建一個lambda函式，允許我識別正則運算式捕獲組是否等于另一個捕獲組，如果是則替換它