在特定排序邏輯上重新排序PySpark資料框列-有解無憂

我有一個 PySpark 資料框，其列順序如下。我需要按照“分支”訂購。我該怎么做？df.select(sorted(df.columns))似乎沒有按我想要的方式作業。

現有列順序：

store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit

所需的列順序：

store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..

uj5u.com熱心網友回復：

您可以手動構建列串列。

col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
    for i in range(1, 13):
        cols.append(col_fmt.format(i, branch))
df.select(cols)

或者，我建議構建一個更好的資料框，利用陣列結構/映射資料型別。例如

months - array (size 12)
  - branches: map<string, struct>
    - key: string  (branch name)
    - value: struct
      - profit: float

這樣，陣列就已經“排序”了。映射順序并不重要，它使特定月份和分支的 SQL 查詢更易于閱讀（并且使用謂詞下推可能更快）

uj5u.com熱心網友回復：

您可能需要使用一些 python 編碼。在以下腳本中，我根據下劃線拆分列名稱_，然后根據元素[3]（分支名稱）和[1]（月份值）進行排序。

輸入 df：

cols = ['store_id',
        'store_name',
        'month_1_branch_A_profit',
        'month_1_branch_B_profit',
        'month_1_branch_C_profit',
        'month_1_branch_D_profit',
        'month_2_branch_A_profit',
        'month_2_branch_B_profit',
        'month_2_branch_C_profit',
        'month_2_branch_D_profit',
        'month_12_branch_A_profit',
        'month_12_branch_B_profit',
        'month_12_branch_C_profit',
        'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))

腳本：

branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
    'store_id', 'store_name',
    *[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)

df.printSchema()
# root
#  |-- store_id: integer (nullable = true)
#  |-- store_name: integer (nullable = true)
#  |-- month_1_branch_A_profit: integer (nullable = true)
#  |-- month_2_branch_A_profit: integer (nullable = true)
#  |-- month_12_branch_A_profit: integer (nullable = true)
#  |-- month_1_branch_B_profit: integer (nullable = true)
#  |-- month_2_branch_B_profit: integer (nullable = true)
#  |-- month_12_branch_B_profit: integer (nullable = true)
#  |-- month_1_branch_C_profit: integer (nullable = true)
#  |-- month_2_branch_C_profit: integer (nullable = true)
#  |-- month_12_branch_C_profit: integer (nullable = true)
#  |-- month_1_branch_D_profit: integer (nullable = true)
#  |-- month_2_branch_D_profit: integer (nullable = true)
#  |-- month_12_branch_D_profit: integer (nullable = true)

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/517804.html

標籤：阿帕奇火花排序pysparkapache-spark-sql多列

上一篇：PyQt5：使用QFileDialog.getOpenFileNames保持選擇順序

下一篇：估計排序演算法的計算成本