如何在AWS膠上將嵌套的JSON擴展為Spark資料幀-有解無憂

使用以下營銷 JSON 檔案

{
    "request_id": "xx",
    "timeseries_stats": [
        {
            "timeseries_stat": {
                "id": "xx",
                "timeseries": [
                    {
                        "start_time": "xx",
                        "end_time": "xx",
                        "stats": {
                            "impressions": xx,
                            "swipes": xx,
                            "view_completion": xx,
                            "spend": xx
                        }
                    },
                    {
                        "start_time": "xx",
                        "end_time": "xx",
                        "stats": {
                            "impressions": xx,
                            "swipes": xx,
                            "view_completion": xx,
                            "spend": xx
                        }
                    }

我可以很容易地使用 Pandas 決議它并以格式獲取所需的資料幀

start_time   end_time     impressions   swipes   view_completion    spend
    xx          xx             xx         xx            xx            xx
    xx          xx             xx         xx            xx            xx

但需要在 AWS Glue 上的 Spark 中完成。

使用創建初始火花資料幀（df）后

rdd = sc.parallelize(JSON_resp['timeseries_stats'][0]['timeseries_stat']['timeseries'])
df = rdd.toDF()

我嘗試按如下方式擴展stats鍵

df_expanded = df.select("start_time","end_time","stats.*")

錯誤：

AnalysisException: 'Can only star expand struct data types. 
Attribute: `ArrayBuffer(stats)`;'

from pyspark.sql.functions import explode
df_expanded = df.select("start_time","end_time").withColumn("stats", explode(df.stats))

錯誤：

AnalysisException: 'The number of aliases supplied in the AS clause does not match the 
number of columns output by the UDTF expected 2 aliases but got stats ;

火花相當新，對于兩種方法中的任何一種，任何幫助都將不勝感激！

這是一個非常相似的問題：

使用 Spark 從 JSON 決議字典陣列

除了我需要展平這個額外的統計資料鍵。

uj5u.com熱心網友回復：

當您explode使用地圖列時，它將為您提供兩列，因此.withColumn無法正常作業。explode與select陳述句一起使用。

df.select('start_time', 'end_time', f.explode('stats')) \
  .groupBy('start_time', 'end_time').pivot('key').agg(f.first('value')).show()

 ---------- -------- ----------- ----- ------ --------------- 
|start_time|end_time|impressions|spend|swipes|view_completion|
 ---------- -------- ----------- ----- ------ --------------- 
|        yy|      yy|         yy|   yy|    yy|             yy|
|        xx|      xx|         xx|   xx|    xx|             xx|
 ---------- -------- ----------- ----- ------ ---------------

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/365957.html

標籤：json 阿帕奇火花火花 apache-spark-sql aws-胶水

上一篇：如何使用SparkSQL查詢過濾中文列名？

下一篇：Hive，如何洗掉磁區，編譯陳述句時出錯：失敗：期望在洗掉磁區陳述句中設定為空