pySpark-在插入資料庫之前將整個資料幀列轉換為JSON物件-有解無憂

在這一點上，我對 pyspark 的了解非常有限，因此我正在尋找一種快速解決方案來解決我當前實作中遇到的這個問題。我正在嘗試通過 pyspark 將 JSON 檔案讀取到資料幀中，將其轉換為可以插入到資料庫表 (DynamoDB) 中的物件。表中的列應代表 JSON 檔案中指定的鍵。例如，如果我的 JSON 檔案包含以下元素：

{
   "Records":[
      {
         "column1":"Value1",
         "column2":"Value2",
         "column3":"Value3",
         "column4":{
            "sub1":"Value4",
            "sub2":"Value5",
            "sub3":{
               "sub4":"Value6",
               "sub5":"Value7"
            }
         }
      },
      {
         "column1":"Value8",
         "column2":"Value9",
         "column3":"Value10",
         "column4":{
            "sub1":"Value11",
            "sub2":"Value12",
            "sub3":{
               "sub4":"Value13",
               "sub5":"Value14"
            }
         }
      }
   ]
}

資料庫表中的列分別為 column1、column2、column3 和 column4。在 column4 是 Map 型別的情況下，我需要將整個物件轉換為字串，然后再將其插入資料庫。因此，在第一行的情況下，我可以期望看到該列的內容：

{
   "sub1":"Value4",
   "sub2":"Value5",
   "sub3":{
      "sub4":"Value6",
      "sub5":"Value7"
   }
}

但是，這是我在運行腳本后在資料庫表中看到的內容：

{ Value4, Value5, { Value6, Value7 }}

I understand this is happening because something needs to be done prior to converting all column values to type String before performing the DB insertion operation:

for col in Rows.columns:
    Rows = Rows.withColumn(col, Rows[col].cast(StringType()))

I'm looking for a way to rectify the contents of Column4 to represent the original JSON object before converting them to the type String. Here is what I've written so far (DB insertion operation excluded)

import pyspark.sql.types as T
from pyspark.sql import functions as SF

df = spark.read.option("multiline", "true").json('/home/abhishek.tirkey/Documents/test')

Records = df.withColumn("Records", SF.explode(SF.col("Records")))

Rows = Records.select(
    "Records.column1",
    "Records.column2",
    "Records.column3",
    "Records.column4",
)

for col in Rows.columns:
    Rows = Rows.withColumn(col, Rows[col].cast(StringType()))

RowsJSON = Rows.toJSON()

uj5u.com熱心網友回復：

有一個to_json功能可以做到這一點：

from pyspark.sql import functions as F

df = df.withColumn("record", F.explode("records")).select(
    "record.column1",
    "record.column2",
    "record.column3",
    F.to_json("record.column4").alias("column4"),
)

df.show()
 ------- ------- ------- --------------------                                   
|column1|column2|column3|             column4|
 ------- ------- ------- -------------------- 
| Value1| Value2| Value3|{"sub1":"Value4",...|
| Value8| Value9|Value10|{"sub1":"Value11"...|
 ------- ------- ------- -------------------- 

df.printSchema()
root
 |-- column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- column3: string (nullable = true)
 |-- column4: string (nullable = true)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/349095.html

標籤：json apache-spark pyspark apache-spark-sql

上一篇：如何在不與現有字典json重疊的情況下將字典附加到串列中

下一篇：當鍵已知但確切位置未知時，從嵌套的Json中提取特定值