我有一個資料框架,其中一列是json串列的形式。我想從該列中提取一個特定的值(分數)并創建獨立的列。
raw_data = [{"user_id"/span> : 1234, "col"/span> : [{"id":14577120145280," score":64. 71,"Elastic_position":0},{"id": 14568530280240,"score":88. 53,"Elastic_position":1},{"id": 14568530119661," score":63. 75,"Elastic_position":2},{"id": 14568530205858," score":62. 79,"Elastic_position":3},{"id": 14568530414899," score":60. 88,"Elastic_position":4}]}。
df = pd.DataFrame.from_dict(raw_data)
我想把我的結果資料框架爆炸成:
uj5u.com熱心網友回復:
假設你有你的json看起來像這樣
# a.json。
# {
# "user_id" : 1234,/span>
# "col" : []
# {"id":14577120145280,"score":64.71,"Elastic_position":0},
# {"id":14568530280240,"score":88.53,"Elastic_position":1},
# {"id":14568530119661,"score":63.75,"Elastic_position":2},
# {"id":14568530205858,"score":62.79,"Elastic_position":3},
# {"id":14568530414899,"score":60.88,"Elastic_position":4}
# ]
# }
你可以讀取它,將其扁平化,然后像這樣透視它
schema = T.StructType([
T.StructField('user_id', T.IntegerType()) 。
T.StructField('col', T.ArrayType(T.StructType([
T.StructField('id', T.LongType()) 。
T.StructField('score', T.DoubleType())。
T.StructField('Elastic_position', T.IntegerType())。
]))),
])
df = spark.read.json('a.json', multiLine=True, schema=schema)
df.show(10, False)
# ------- --------------------------------------------------------------------------------------------------------------------------------------------
# |user_id|col |
# ------- --------------------------------------------------------------------------------------------------------------------------------------------
# |1234 |[{14577120145280, 64.71, 0}, {14568530280240, 88.53, 1}, {14568530119661, 63.75, 2}, {14568530205858, 62.79, 3}, {14568530414899, 60.88, 4}] |
# ------- --------------------------------------------------------------------------------------------------------------------------------------------
df.printSchema()
# root
# |-- user_id: integer (nullable = true)
# |-- col: array (nullable = true)
# |-- element: struct (containsNull = true)
# | |-- id: long (nullable = true)
# | |-- score: double (nullable = true)
# | |-- Elastic_position: integer (nullable = true)。
(df
.select('user_id'/span>, F.explode('col'/span>)
.groupBy('user_id')
.pivot('col.Elastic_position')
.agg(F.first('col.score'))
.show(10, False)
)
# output 10 .
# ------- ----- ----- ----- ----- -----
# |user_id|0 |1 |2 |3 |4 |
# ------- ----- ----- ----- ----- -----
# |1234 |64.71|88.53|63.75|62.79|60.88|
# ------- ----- ----- ----- ----- -----
uj5u.com熱心網友回復:
嘗試使用pd.Series.explode與groupby:
df = pd.DataFrame.from_dict(raw_data).explode('col')
df.assign(col=df['col'].str['score']) 。 groupby('user_id').agg(list) 。 apply(lambda x: (y:=x.explode()).set_axis(y.index '_' y. groupby(level=0).cumcount().astype(str)), axis=1).reset_index()
user_id col_0 col_1 col_2 col_3 col_4
0 1234 64.71 88. 53 63.75 62.79 60.88
如果首先構建一個資料框架并對col列進行爆炸,然后通過重復的user_ids進行分組,并執行另一個explode使其由長變寬,然后用cumcount添加前綴0到4。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/332436.html
標籤:

