我有一個用例,其中多個鍵以 JSON 格式分布在資料集中,需要將其聚合到合并的結果集中以進行進一步處理。
我已經能夠開發出一種代碼結構,可以同時使用Python API(PySpark) &來實作它Spark SQL,但是后者涉及到更復雜和更遲的執行程序,并且涉及到將來可能導致錯誤的中間對話。
使用下面的代碼片段,有沒有更好的方法來實作這個 using Spark SQL,通過創建一個MAP<STRING,ARRAY<STRING>using keyand value?
資料準備
from pyspark.sql.types import *
import pandas as pd
from io import StringIO
s = StringIO("""
id|json_struct
1|{"a":["tyeqb","",""],"e":["qwrqc","",""]}
1|{"t":["sartq","",""],"r":["fsafsq","",""]}
1|{"b":["puhqiqh","",""],"e":["hjfsaj","",""]}
2|{"b":["basajhjwa","",""],"e":["asfafas","",""]}
2|{"n":["gaswq","",""],"r":["sar","",""],"l":["sar","",""],"s":["rqqrq","",""],"m":["wrqwrq","",""]}
2|{"s":["tqqwjh","",""],"t":["afs","",""],"l":["fsaafs","",""]}
""")
df = pd.read_csv(s,delimiter='|')
sparkDF = spark.createDataFrame(df)
sparkDF.registerTempTable("INPUT")
sparkDF = sparkDF.withColumn('json_struct', F.from_json(F.col('json_struct')
,schema=MapType(StringType(),ArrayType(StringType()),True)
))
sparkDF.show(truncate=False)
--- ---------------------------------------------------------------------------------------
|id |json_struct |
--- ---------------------------------------------------------------------------------------
|1 |{a -> [tyeqb, , ], e -> [qwrqc, , ]} |
|1 |{t -> [sartq, , ], r -> [fsafsq, , ]} |
|1 |{b -> [puhqiqh, , ], e -> [hjfsaj, , ]} |
|2 |{b -> [basajhjwa, , ], e -> [asfafas, , ]} |
|2 |{n -> [gaswq, , ], r -> [sar, , ], l -> [sar, , ], s -> [rqqrq, , ], m -> [wrqwrq, , ]}|
|2 |{s -> [tqqwjh, , ], t -> [afs, , ], l -> [fsaafs, , ]} |
--- ---------------------------------------------------------------------------------------
Python API (PySpark) - 實作
如您所見,結果keyfromexplode本身就是一個STRING型別,并且由于 PySpark 有create_map,它在 中不可用Spark SQL,它可以很容易地用于生成最終json_struct列,確保具有可變長度ARRAYTYPE<STRING>值的單個鍵
sparkDF.select(
F.col('id')
,F.explode(F.col('json_struct'))
).withColumn('value',F.filter(F.col('value'), lambda x: x != '')\
).withColumn('value',F.concat_ws(',', F.col('value'))\
).groupBy('id', 'key'
).agg(F.collect_set(F.col('value')).alias('value')\
).withColumn('json_struct',F.to_json(F.create_map("key","value"))
).orderBy('id'
).show(truncate=False)
--- --- --------------- ------------------------
|id |key|value |json_struct |
--- --- --------------- ------------------------
|1 |a |[tyeqb] |{"a":["tyeqb"]} |
|1 |e |[hjfsaj, qwrqc]|{"e":["hjfsaj","qwrqc"]}|
|1 |r |[fsafsq] |{"r":["fsafsq"]} |
|1 |b |[puhqiqh] |{"b":["puhqiqh"]} |
|1 |t |[sartq] |{"t":["sartq"]} |
|2 |b |[basajhjwa] |{"b":["basajhjwa"]} |
|2 |n |[gaswq] |{"n":["gaswq"]} |
|2 |t |[afs] |{"t":["afs"]} |
|2 |s |[tqqwjh, rqqrq]|{"s":["tqqwjh","rqqrq"]}|
|2 |e |[asfafas] |{"e":["asfafas"]} |
|2 |l |[sar, fsaafs] |{"l":["sar","fsaafs"]} |
|2 |r |[sar] |{"r":["sar"]} |
|2 |m |[wrqwrq] |{"m":["wrqwrq"]} |
--- --- --------------- ------------------------
Spark SQL - 實作
在這個實作中,我必須采取額外的步驟來確保key和value列的ARRAYTYPE長度與map_from_arrays將陣列作為輸入的長度一致。
有沒有辦法繞過這些并創建一個類似的模式,如使用描述的那樣Python API?
sql.sql("""
SELECT
id,
KEY,
VALUE,
TO_JSON(MAP_FROM_ARRAYS(KEY,VALUE)) as json_struct
FROM (
SELECT
id,
key,
ARRAY(COLLECT_SET( value )) as value -- <------- ### Ensuring Value is NESTED ARRAY
FROM (
SELECT
id,
SPLIT(k,'|',1) as key, -- <------- ### Ensuring Key is Array
CONCAT_WS(',',FILTER(v,x -> x != '')) as value
FROM (
SELECT
id,
EXPLODE(FROM_JSON(json_struct,'MAP<STRING,ARRAY<STRING>>')) as (k,v)
FROM INPUT
)
)
GROUP BY 1,2
)
ORDER BY 1
""").show(truncate=False)
--- --- ----------------- ------------------------
|id |KEY|VALUE |json_struct |
--- --- ----------------- ------------------------
|1 |[a]|[[tyeqb]] |{"a":["tyeqb"]} |
|1 |[e]|[[hjfsaj, qwrqc]]|{"e":["hjfsaj","qwrqc"]}|
|1 |[b]|[[puhqiqh]] |{"b":["puhqiqh"]} |
|1 |[r]|[[fsafsq]] |{"r":["fsafsq"]} |
|1 |[t]|[[sartq]] |{"t":["sartq"]} |
|2 |[n]|[[gaswq]] |{"n":["gaswq"]} |
|2 |[b]|[[basajhjwa]] |{"b":["basajhjwa"]} |
|2 |[t]|[[afs]] |{"t":["afs"]} |
|2 |[s]|[[tqqwjh, rqqrq]]|{"s":["tqqwjh","rqqrq"]}|
|2 |[e]|[[asfafas]] |{"e":["asfafas"]} |
|2 |[l]|[[sar, fsaafs]] |{"l":["sar","fsaafs"]} |
|2 |[r]|[[sar]] |{"r":["sar"]} |
|2 |[m]|[[wrqwrq]] |{"m":["wrqwrq"]} |
--- --- ----------------- ------------------------
uj5u.com熱心網友回復:
Spark SQL 而不是create_maphas map. 您的 PySpark 代碼可以翻譯成這樣:
df = spark.sql("""
WITH
TBL2 (SELECT id, EXPLODE(FROM_JSON(json_struct,'MAP<STRING,ARRAY<STRING>>')) from INPUT),
TBL3 (SELECT id, key, FLATTEN(COLLECT_SET(FILTER(value, x -> x != ''))) value
FROM TBL2
GROUP BY id, key)
SELECT *, TO_JSON(MAP(key, value)) json_struct
FROM TBL3
""")
df.show(truncate=0)
# --- --- --------------- ------------------------
# |id |key|value |json_struct |
# --- --- --------------- ------------------------
# |1 |a |[tyeqb] |{"a":["tyeqb"]} |
# |1 |e |[qwrqc, hjfsaj]|{"e":["qwrqc","hjfsaj"]}|
# |1 |b |[puhqiqh] |{"b":["puhqiqh"]} |
# |1 |r |[fsafsq] |{"r":["fsafsq"]} |
# |1 |t |[sartq] |{"t":["sartq"]} |
# |2 |b |[basajhjwa] |{"b":["basajhjwa"]} |
# |2 |n |[gaswq] |{"n":["gaswq"]} |
# |2 |s |[rqqrq, tqqwjh]|{"s":["rqqrq","tqqwjh"]}|
# |2 |t |[afs] |{"t":["afs"]} |
# |2 |e |[asfafas] |{"e":["asfafas"]} |
# |2 |l |[fsaafs, sar] |{"l":["fsaafs","sar"]} |
# |2 |r |[sar] |{"r":["sar"]} |
# |2 |m |[wrqwrq] |{"m":["wrqwrq"]} |
# --- --- --------------- ------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/525644.html
標籤:sql阿帕奇火花字典pysparkapache-spark-sql
