從結構Pyspark的陣列創建列-有解無憂

我對資料處理很陌生。我有一個深度嵌套的資料集，它大約有這個模式：

 |-- col1 : string
 |-- col2 : string
 |-- col3: struct
 |    |-- something : string
 |    |-- elem: array
 |    |    |-- registrationNumber: struct
 |    |    |     |-- registrationNumber : string
 |    |    |     |-- registrationNumberType : string
 |    |    |     |-- registrationCode : int

對于陣列，我會收到這樣的東西。請記住，長度是可變的，我可能沒有收到任何值或 10 甚至更多

[
  {
    registrationNumber : 123456789
    registrationNumberType : VAT
    registrationCode : 1234
  },
  {
    registrationNumber : ABCDERTYU
    registrationNumberType : fiscal1
    registrationCode : 9876
  },
  {
    registrationNumber : 123456789
    registrationNumberType : foo
    registrationCode : 8765
  }
]

有沒有辦法將架構轉換為：

 |-- col1 : string
 |-- col2 : string
 |-- col3: struct
 |    |-- something : string
 |    |-- VAT: string
 |    |-- fiscal1: string

和VAT價值fiscal1是registrationNumber價值。我基本上需要得到一個列，VAT其fiscal1值為列

非常感謝

編輯：

這是一個示例 json col3

{
        "col3": {
            "somestring": "xxxxxx",
            "registrationNumbers": [
              {
                'registrationNumber' : 'something',
                'registrationNumberType' : 'VAT'
              },
              {
                'registrationNumber' : 'somethingelse',
                'registrationNumberType' : 'fiscal1'
              },
              {
                'registrationNumber' : 'something i dont need',
                'registrationNumberType' : 'fiscal2'
              }
            ]
        }
}

這就是我想要的：

{
        "col3": {
            "somestring": "xxxxxx",
            "VAT" : "something"
            "fiscal1" : "somethingelse"
        }
}

也許我可以，使用陣列和主鍵創建一個資料框，創建列VAT并fiscal1從新資料框中選擇資料以輸入到列中？最后使用主鍵加入 2 個資料幀

uj5u.com熱心網友回復：

您可以使用inline函式來分解和擴展陣列的結構元素，然后僅使用or和 pivotcol3.registrationNumbers過濾行。透視后，使用透視列更新結構列：registrationNumberTypeVATfiscal1col3

import pyspark.sql.functions as F

exampleJSON = '{"col1":"col1_XX","col2":"col2_XX","col3":{"somestring":"xxxxxx","registrationNumbers":[{"registrationNumber":"something","registrationNumberType":"VAT"},{"registrationNumber":"somethingelse","registrationNumberType":"fiscal1"},{"registrationNumber":"something i dont need","registrationNumberType":"fiscal2"}]}}'
df = spark.read.json(sc.parallelize([exampleJSON]))

df1 = df.selectExpr("*", "inline(col3.registrationNumbers)") \
    .filter(F.col("registrationNumberType").isin(["VAT", "fiscal1"])) \
    .groupBy("col1", "col2", "col3") \
    .pivot("registrationNumberType") \
    .agg(F.first("registrationNumber")) \
    .withColumn("col3", F.struct(F.col("col3.somestring"), F.col("VAT"), F.col("fiscal1"))) \
    .drop("VAT", "fiscal1")

df1.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: struct (nullable = false)
# |    |-- somestring: string (nullable = true)
# |    |-- VAT: string (nullable = true)
# |    |-- fiscal1: string (nullable = true)

df1.show(truncate=False)
# ------- ------- ---------------------------------- 
#|col1   |col2   |col3                              |
# ------- ------- ---------------------------------- 
#|col1_XX|col2_XX|{xxxxxx, something, somethingelse}|
# ------- ------- ----------------------------------

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/416371.html

標籤：

上一篇：展平資料框中的字串陣列欄位

下一篇：如何與pyspark中ArrayType列的每個元素進行互動？