我有一個 PySpark 資料框,其列順序如下。我需要按照“分支”訂購。我該怎么做?df.select(sorted(df.columns))似乎沒有按我想要的方式作業。
現有列順序:
store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit
所需的列順序:
store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..
uj5u.com熱心網友回復:
您可以手動構建列串列。
col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
for i in range(1, 13):
cols.append(col_fmt.format(i, branch))
df.select(cols)
或者,我建議構建一個更好的資料框,利用陣列 結構/映射資料型別。例如
months - array (size 12)
- branches: map<string, struct>
- key: string (branch name)
- value: struct
- profit: float
這樣,陣列就已經“排序”了。映射順序并不重要,它使特定月份和分支的 SQL 查詢更易于閱讀(并且使用謂詞下推可能更快)
uj5u.com熱心網友回復:
您可能需要使用一些 python 編碼。在以下腳本中,我根據下劃線拆分列名稱_,然后根據元素[3](分支名稱)和[1](月份值)進行排序。
輸入 df:
cols = ['store_id',
'store_name',
'month_1_branch_A_profit',
'month_1_branch_B_profit',
'month_1_branch_C_profit',
'month_1_branch_D_profit',
'month_2_branch_A_profit',
'month_2_branch_B_profit',
'month_2_branch_C_profit',
'month_2_branch_D_profit',
'month_12_branch_A_profit',
'month_12_branch_B_profit',
'month_12_branch_C_profit',
'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))
腳本:
branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
'store_id', 'store_name',
*[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)
df.printSchema()
# root
# |-- store_id: integer (nullable = true)
# |-- store_name: integer (nullable = true)
# |-- month_1_branch_A_profit: integer (nullable = true)
# |-- month_2_branch_A_profit: integer (nullable = true)
# |-- month_12_branch_A_profit: integer (nullable = true)
# |-- month_1_branch_B_profit: integer (nullable = true)
# |-- month_2_branch_B_profit: integer (nullable = true)
# |-- month_12_branch_B_profit: integer (nullable = true)
# |-- month_1_branch_C_profit: integer (nullable = true)
# |-- month_2_branch_C_profit: integer (nullable = true)
# |-- month_12_branch_C_profit: integer (nullable = true)
# |-- month_1_branch_D_profit: integer (nullable = true)
# |-- month_2_branch_D_profit: integer (nullable = true)
# |-- month_12_branch_D_profit: integer (nullable = true)
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/517804.html
標籤:阿帕奇火花排序pysparkapache-spark-sql多列
上一篇:PyQt5:使用QFileDialog.getOpenFileNames保持選擇順序
下一篇:估計排序演算法的計算成本
