我有一個只有一列的 pyspark DataFrame,如下所示:
df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")
我想將代碼提取col1到其他列中,例如:
df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]
有誰知道如何做到這一點?非常感謝。
uj5u.com熱心網友回復:
我相信這可以縮短。長手給你我的邏輯。如果您在問題中制定邏輯會更容易
#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))
#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]
df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9] $'))"))#Filter only non alphanumeric characters in the array
#Convert resulting array into struct to allow split
.withColumn(
"k",
F.struct(*[
F.col("k")[i].alias(f"col{i 2}") for i in range(s)
])
))
#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()
-------------------- ------------ ------------ ---------- ----
| col1| col2| col3| col4|col5|
-------------------- ------------ ------------ ---------- ----
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834| null| null|null|
|U1JG97297 And ODN...| U1JG97297| ODNO926902| null|null|
|there is nothing ...| null| null| null|null|
-------------------- ------------ ------------ ---------- ----
uj5u.com熱心網友回復:
正如您在評論中所說,在這里我們假設您的“代碼”是至少由大寫字母和數字組成的至少兩個字符的字串。
話雖如此,從 Spark 3.1 開始,您可以使用regexp_extract_all函式expr來創建一個包含所有代碼的臨時陣列列,然后為陣列的每個條目動態創建多個列。
import pyspark.sql.functions as F
# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))
# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))
# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col' str(i 2)) for i in range(max_array_length)]) \
.show(truncate=False)
Max array length: 3
------------------------------------ ------------ ------------ ----------
|col1 |col2 |col3 |col4 |
------------------------------------ ------------ ------------ ----------
|This is AD185E000834 |AD185E000834|null |null |
|U1JG97297 And ODNO926902 etc. |U1JG97297 |ODNO926902 |null |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here. |null |null |null |
------------------------------------ ------------ ------------ ----------
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/456721.html
上一篇:拆分行中的字串以分隔R中的列
下一篇:什么時候需要在函式中創建新變數?
