我有一個 df 列,其中包含我不想要的電子郵件和更多資訊。這里有些例子:
Email_Col
"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>
"YourBoss" <[email protected]>
"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>
我需要清理列或使用電子郵件創建一個新列。這里是預期的輸出,一個陣列列:
New_Email_Col
[[email protected], Stark, [email protected]]
[[email protected]]
[[email protected] [email protected], [email protected]]
我的代碼:
import re
def extract(col):
for row in col:
all_matches = re.findall(r'\w .\w @\w .\w ', row)
return all_matches
extract_udf = udf(lambda col: extract(col), ArrayType(StringType()))
df = df.withColumn(('emails'), extract_udf(col('to')))
我的錯誤:
PythonException:'TypeError:預期的字串或類似位元組的物件',來自第 4 行。下面的完整回溯
uj5u.com熱心網友回復:
請不要udf- 它們很慢,現在在絕大多數情況下都不需要。以下是訣竅:
F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)")
regexp_extract_all可從 Spark 3.1
完整示例:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('''"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>''',),
('''"YourBoss" <[email protected]>''',),
('''"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>''',)],
['Email_Col'])
df = df.withColumn('Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))
df.show(truncate=0)
# --------------------------------------------------------------------
# |Email_Col |
# --------------------------------------------------------------------
# |[[email protected], [email protected]] |
# |[[email protected]] |
# |[[email protected], [email protected], [email protected]]|
# --------------------------------------------------------------------
要添加單獨的新列:
df = df.withColumn('New_Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/518516.html
