從pyspark中的列中提取多個子字串-有解無憂

我有一個只有一列的 pyspark DataFrame，如下所示：

df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")

我想將代碼提取col1到其他列中，例如：

df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]

有誰知道如何做到這一點？非常感謝。

uj5u.com熱心網友回復：

我相信這可以縮短。長手給你我的邏輯。如果您在問題中制定邏輯會更容易

#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))

#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]


df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9] $'))"))#Filter only non alphanumeric characters in the array
     
      #Convert resulting array into struct to allow split
      .withColumn(
    "k",
    F.struct(*[
        F.col("k")[i].alias(f"col{i 2}") for i in range(s)
    ])
))

#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()

 -------------------- ------------ ------------ ---------- ---- 
|                col1|        col2|        col3|      col4|col5|
 -------------------- ------------ ------------ ---------- ---- 
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834|        null|      null|null|
|U1JG97297 And ODN...|   U1JG97297|  ODNO926902|      null|null|
|there is nothing ...|        null|        null|      null|null|
 -------------------- ------------ ------------ ---------- ----

uj5u.com熱心網友回復：

正如您在評論中所說，在這里我們假設您的“代碼”是至少由大寫字母和數字組成的至少兩個字符的字串。

話雖如此，從 Spark 3.1 開始，您可以使用regexp_extract_all函式expr來創建一個包含所有代碼的臨時陣列列，然后為陣列的每個條目動態創建多個列。

import pyspark.sql.functions as F

# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))

# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))

# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col'   str(i 2)) for i in range(max_array_length)]) \
  .show(truncate=False)



Max array length: 3
 ------------------------------------ ------------ ------------ ---------- 
|col1                                |col2        |col3        |col4      |
 ------------------------------------ ------------ ------------ ---------- 
|This is AD185E000834                |AD185E000834|null        |null      |
|U1JG97297 And ODNO926902 etc.       |U1JG97297   |ODNO926902  |null      |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here.              |null        |null        |null      |
 ------------------------------------ ------------ ------------ ----------

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/456721.html

標籤：Python 正则表达式细绳 pyspark

上一篇：拆分行中的字串以分隔R中的列

下一篇：什么時候需要在函式中創建新變數？