我有一個包含狀態串列和狀態串列的資料框。我需要按狀態分組并找出每個工資范圍中有多少條目(總共有 3 個工資范圍)創建一個資料框并排序結果基于 state name 。Spark中是否有任何功能可以實作這一點。
Sample input
State salary
------ ------
NY 6
WI 15
NY 11
WI 2
MI 20
NY 15
Result expected is
State group1 group2 group3
MI 0 0 1
NY 0 1 2
WI 1 0 1
在哪里
- Group1 是工資數 > 0 且 <= 5
- Group2 是工資數 > 5 且 <=10
- Group3 是工資數 >10 和 <=20
基本上從像Scala spark這樣的東西看
df.groupBy('STATE').agg(count('*') as group1).where('SALARY' >0 and 'SALARY' <=5)
.agg(count('*') as group2).where('SALARY' >5 and 'SALARY' <=10)
.agg(count('*') as group3).where('SALARY' >10 and 'SALARY' <=20)
解決方案更新:
解決方案 1:能夠按照下面提供的方法解決,但不確定是否有更簡單有效的方法。任何方向?dfWithoutSchema 是輸入資料框
val newDf = dfWithoutSchema.withColumn("set1", when($"salary">0 and $"salary" <= 5, 1).otherwise(0)).withColumn("set2", when($"salary">5 and $"salary" <= 10, 1).otherwise(0)).withColumn("set3", when($"salary">10 and $"salary" <= 20, 1).otherwise(0))
val fdf=newDf.groupBy("state").agg(sum("set1") as "group1",sum("set2") as "group2",sum("set3") as "group3").sort("state")
解決方案2:
val agg_df = df.groupBy("State")
.agg(
count(when($"Salary" > 0 && $"Salary" <= 5, $"Salary")).as("group_1"),
count(when($"Salary" > 5 && $"Salary" <= 10, $"Salary")).as("group_2"),
count(when($"Salary" > 10 && $"Salary" <= 20, $"Salary")).as("group_3")
)
uj5u.com熱心網友回復:
您可以指定要count/sum匯總的條件。
例子:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [
{"State": "NY", "Salary": 6},
{"State": "WI", "Salary": 15},
{"State": "NY", "Salary": 11},
{"State": "WI", "Salary": 2},
{"State": "MI", "Salary": 20},
{"State": "NY", "Salary": 15},
]
df = spark.createDataFrame(data=data)
cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
df = df.groupBy("State").agg(
cnt_cond((F.col("Salary") > 0) & (F.col("Salary") <= 5)).alias("group_1"),
cnt_cond((F.col("Salary") > 5) & (F.col("Salary") <= 10)).alias("group_2"),
cnt_cond((F.col("Salary") > 10) & (F.col("Salary") <= 20)).alias("group_3"),
)
這sum與count因為它檢查條件并1在滿足條件時回傳,否則相同0。
使用斯卡拉:
val agg_df = df.groupBy("State")
.agg(
count(when($"Salary" > 0 && $"Salary" <= 5, $"Salary")).as("group_1"),
count(when($"Salary" > 5 && $"Salary" <= 10, $"Salary")).as("group_2"),
count(when($"Salary" > 10 && $"Salary" <= 20, $"Salary")).as("group_3")
)
結果:
----- ------- ------- -------
|State|group_1|group_2|group_3|
----- ------- ------- -------
|NY |0 |1 |2 |
|WI |1 |0 |1 |
|MI |0 |0 |1 |
----- ------- ------- -------
uj5u.com熱心網友回復:
您可以使用由sum和case函式組成的運算式。
data = [
('NY', 6),
('WI', 15),
('NY', 11),
('WI', 2),
('MI', 20),
('NY', 15)
]
df = spark.createDataFrame(data, ['State', 'salary'])
df = df.groupBy('State').agg(F.expr('sum(case when salary>0 and salary<=5 then 1 else 0 end)').alias('group1'),
F.expr('sum(case when salary>5 and salary<=10 then 1 else 0 end)').alias('group2'),
F.expr('sum(case when salary>10 and salary<=20 then 1 else 0 end)').alias('group3'))
df.show(truncate=False)
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/377867.html
