我有一個初始資料框
df1 =
--- --- ---
| A| B| C|
--- --- ---
| 1| 1| 10|
| 1| 2| 11|
| 1| 2| 12|
| 3| 1| 13|
| 2| 1| 14|
| 2| 1| 15|
| 2| 1| 16|
| 4| 1| 17|
| 4| 2| 18|
| 4| 3| 19|
| 4| 4| 19|
| 4| 5| 20|
| 4| 5| 20|
--- --- ---
使用pyspark,我使用collect_list函式對資料框進行了編碼,同時考慮了分組列'A'并考慮了'B'排序以創建具有累積串列的列的列
spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',collect_list('C').over(spec))
df1.orderBy('A','B').show()
--- --- --- ------------------------
|A |B |C |D |
--- --- --- ------------------------
|1 |1 |10 |[10] |
|1 |2 |11 |[10, 11, 12] |
|1 |2 |12 |[10, 11, 12] |
|2 |1 |14 |[14, 15, 16] |
|2 |1 |15 |[14, 15, 16] |
|2 |1 |16 |[14, 15, 16] |
|3 |1 |13 |[13] |
|4 |1 |17 |[17] |
|4 |2 |18 |[17, 18] |
|4 |3 |19 |[17, 18, 19] |
|4 |4 |19 |[17, 18, 19, 19] |
|4 |5 |20 |[17, 18, 19, 19, 20, 20]|
|4 |5 |20 |[17, 18, 19, 19, 20, 20]|
--- --- --- ------------------------
是否可以使用 Pandas Dataframe 進行相同的計算?
我嘗試使用一些“普通”python 代碼,但可能有一種更直接的方法。
uj5u.com熱心網友回復:
在熊貓解決這個問題的一種方法是使用兩個GROUPBY的即,第一組資料幀上柱A,然后對每個組應用自定義定義的函式collect_list,其在匝組通過柱輸入B和累積地聚集柱C使用list
def collect_list(g):
return g.groupby('B')['C'].agg(list).cumsum()
df.sort_values(['A', 'B']).merge(
df.groupby('A').apply(collect_list).reset_index(name='D'))
A B C D
0 1 1 10 [10]
1 1 2 11 [10, 11, 12]
2 1 2 12 [10, 11, 12]
4 2 1 14 [14, 15, 16]
5 2 1 15 [14, 15, 16]
6 2 1 16 [14, 15, 16]
3 3 1 13 [13]
7 4 1 17 [17]
8 4 2 18 [17, 18]
9 4 3 19 [17, 18, 19]
10 4 4 19 [17, 18, 19, 19]
11 4 5 20 [17, 18, 19, 19, 20, 20]
12 4 5 20 [17, 18, 19, 19, 20, 20]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/345659.html
