從pyspark的json串列中提取數值 -有解無憂

我有一個資料框架，其中一列是json串列的形式。我想從該列中提取一個特定的值（分數）并創建獨立的列。

raw_data = [{"user_id"/span> : 1234, "col"/span> : [{"id":14577120145280," score":64. 71,"Elastic_position":0},{"id": 14568530280240,"score":88. 53,"Elastic_position":1}，{"id": 14568530119661," score":63. 75,"Elastic_position":2},{"id": 14568530205858," score":62. 79,"Elastic_position":3},{"id": 14568530414899," score":60. 88,"Elastic_position":4}]}。

df = pd.DataFrame.from_dict(raw_data)

我想把我的結果資料框架爆炸成：

uj5u.com熱心網友回復：

假設你有你的json看起來像這樣

# a.json。 # { # "user_id" : 1234,/span> # "col" : [] # {"id":14577120145280,"score":64.71,"Elastic_position":0}, # {"id":14568530280240,"score":88.53,"Elastic_position":1}, # {"id":14568530119661,"score":63.75,"Elastic_position":2}, # {"id":14568530205858,"score":62.79,"Elastic_position":3}, # {"id":14568530414899,"score":60.88,"Elastic_position":4} # ] # }

你可以讀取它，將其扁平化，然后像這樣透視它

schema = T.StructType([
    T.StructField('user_id', T.IntegerType()) 。
    T.StructField('col', T.ArrayType(T.StructType([
        T.StructField('id', T.LongType()) 。
        T.StructField('score', T.DoubleType())。
        T.StructField('Elastic_position', T.IntegerType())。
    ]))),
])

df = spark.read.json('a.json', multiLine=True, schema=schema)
df.show(10, False)
#  ------- -------------------------------------------------------------------------------------------------------------------------------------------- 
# |user_id|col |
#  ------- -------------------------------------------------------------------------------------------------------------------------------------------- 
# |1234 |[{14577120145280, 64.71, 0}, {14568530280240, 88.53, 1}, {14568530119661, 63.75, 2}, {14568530205858, 62.79, 3}, {14568530414899, 60.88, 4}] |
#  ------- -------------------------------------------------------------------------------------------------------------------------------------------- 


df.printSchema()
# root
# |-- user_id: integer (nullable = true)
# |-- col: array (nullable = true)
# |-- element: struct (containsNull = true)
# | |-- id: long (nullable = true)
# | |-- score: double (nullable = true)
# | |-- Elastic_position: integer (nullable = true)。

(df
    .select('user_id'/span>, F.explode('col'/span>)
    .groupBy('user_id')
    .pivot('col.Elastic_position')
    .agg(F.first('col.score'))
    .show(10, False)
)

# output 10 .
#  ------- ----- ----- ----- ----- ----- 
# |user_id|0 |1 |2 |3 |4 |
#  ------- ----- ----- ----- ----- ----- 
# |1234 |64.71|88.53|63.75|62.79|60.88|
#  ------- ----- ----- ----- ----- -----

uj5u.com熱心網友回復：

嘗試使用pd.Series.explode與groupby：

df = pd.DataFrame.from_dict(raw_data).explode('col')
df.assign(col=df['col'].str['score']) 。 groupby('user_id').agg(list) 。 apply(lambda x: (y:=x.explode()).set_axis(y.index   '_'   y. groupby(level=0).cumcount().astype(str)), axis=1).reset_index()

 user_id col_0 col_1 col_2 col_3 col_4
0 1234 64.71 88. 53 63.75 62.79 60.88

如果首先構建一個資料框架并對col列進行爆炸，然后通過重復的user_ids進行分組，并執行另一個explode使其由長變寬，然后用cumcount添加前綴0到4。

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/332436.html

標籤：

上一篇：如何將MMM-YYY轉換為YYY-MM-DD，將DD設定為資料框架中的最后一天？

下一篇：使用SAXONEE10.6的XMLXSLT流大xml檔案