我有一個看起來像這樣的資料框
columns = ['id', 'department', 'score']
vals = [
(1, 'AB', 141),
(2, 'AB', 140),
(3, 'AB', 210),
(4, 'AB', 120),
(5, 'EF', 20),
(6, 'EF', 15)
]
我想找到score每組的最大值department并劃分該組的所有值。例如,在上述情況下:
AB 的 max_val 是 210 EF 的 max_val 是 20
新的資料集應該是:
(1, 'AB', 0.67),
(2, 'AB', 0.67),
(3, 'AB', 1.00),
(4, 'AB', 0.57),
(5, 'EF', 1.00),
(6, 'EF', 0.75)
現在我已經試過了
>>> max_distance = df.groupby("department").agg({"score": "max"}).collect()
>>> max_distance
[Row(department='AB', max(score)=210.0), Row(department='EF', max(score)=20.0)]
但是我如何在整個組中劃分它?
uj5u.com熱心網友回復:
您應該使用max由department.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_sdf. \
withColumn('maxval', func.max('score').over(wd.partitionBy('department'))). \
withColumn('val_perc', func.col('score') / func.col('maxval')). \
show()
# --- ---------- ----- ------ ------------------
# | id|department|score|maxval| val_perc|
# --- ---------- ----- ------ ------------------
# | 5| EF| 20| 20| 1.0|
# | 6| EF| 15| 20| 0.75|
# | 1| AB| 141| 210|0.6714285714285714|
# | 2| AB| 140| 210|0.6666666666666666|
# | 3| AB| 210| 210| 1.0|
# | 4| AB| 120| 210|0.5714285714285714|
# --- ---------- ----- ------ ------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/524761.html
