是否有scalaspark函式來實作groupby然后過濾然后聚合-有解無憂

我有一個包含狀態串列和狀態串列的資料框。我需要按狀態分組并找出每個工資范圍中有多少條目（總共有 3 個工資范圍）創建一個資料框并排序結果基于 state name 。Spark中是否有任何功能可以實作這一點。

Sample input 

State  salary
------ ------
NY      6
WI      15
NY      11
WI      2
MI      20
NY      15 
 
Result expected is

State    group1   group2  group3
 MI         0       0       1  
 NY         0       1       2
 WI         1       0       1

在哪里

Group1 是工資數 > 0 且 <= 5
Group2 是工資數 > 5 且 <=10
Group3 是工資數 >10 和 <=20

基本上從像Scala spark這樣的東西看

df.groupBy('STATE').agg(count('*') as group1).where('SALARY' >0 and 'SALARY' <=5)
.agg(count('*') as group2).where('SALARY' >5 and 'SALARY' <=10)
.agg(count('*') as group3).where('SALARY' >10 and 'SALARY' <=20)

解決方案更新：

解決方案 1：能夠按照下面提供的方法解決，但不確定是否有更簡單有效的方法。任何方向？dfWithoutSchema 是輸入資料框

val newDf = dfWithoutSchema.withColumn("set1", when($"salary">0 and $"salary" <= 5, 1).otherwise(0)).withColumn("set2", when($"salary">5 and $"salary" <= 10, 1).otherwise(0)).withColumn("set3", when($"salary">10 and $"salary" <= 20, 1).otherwise(0))
val fdf=newDf.groupBy("state").agg(sum("set1") as "group1",sum("set2") as "group2",sum("set3") as "group3").sort("state")

解決方案2：

val agg_df = df.groupBy("State")
    .agg(
        count(when($"Salary" > 0 && $"Salary" <= 5, $"Salary")).as("group_1"),
        count(when($"Salary" > 5 && $"Salary" <= 10, $"Salary")).as("group_2"),
        count(when($"Salary" > 10 && $"Salary" <= 20, $"Salary")).as("group_3")
    )

uj5u.com熱心網友回復：

您可以指定要count/sum匯總的條件。

例子：

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()
data = [
    {"State": "NY", "Salary": 6},
    {"State": "WI", "Salary": 15},
    {"State": "NY", "Salary": 11},
    {"State": "WI", "Salary": 2},
    {"State": "MI", "Salary": 20},
    {"State": "NY", "Salary": 15},
]
df = spark.createDataFrame(data=data)
cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
df = df.groupBy("State").agg(
    cnt_cond((F.col("Salary") > 0) & (F.col("Salary") <= 5)).alias("group_1"),
    cnt_cond((F.col("Salary") > 5) & (F.col("Salary") <= 10)).alias("group_2"),
    cnt_cond((F.col("Salary") > 10) & (F.col("Salary") <= 20)).alias("group_3"),
)

這sum與count因為它檢查條件并1在滿足條件時回傳，否則相同0。

使用斯卡拉：

val agg_df = df.groupBy("State")
    .agg(
        count(when($"Salary" > 0 && $"Salary" <= 5, $"Salary")).as("group_1"),
        count(when($"Salary" > 5 && $"Salary" <= 10, $"Salary")).as("group_2"),
        count(when($"Salary" > 10 && $"Salary" <= 20, $"Salary")).as("group_3")
    )

結果：

 ----- ------- ------- -------                                                  
|State|group_1|group_2|group_3|
 ----- ------- ------- ------- 
|NY   |0      |1      |2      |
|WI   |1      |0      |1      |
|MI   |0      |0      |1      |
 ----- ------- ------- -------

uj5u.com熱心網友回復：

您可以使用由sum和case函式組成的運算式。

data = [
    ('NY', 6),
    ('WI', 15),
    ('NY', 11),
    ('WI', 2),
    ('MI', 20),
    ('NY', 15)
]
df = spark.createDataFrame(data, ['State', 'salary'])
df = df.groupBy('State').agg(F.expr('sum(case when salary>0 and salary<=5 then 1 else 0 end)').alias('group1'),
                             F.expr('sum(case when salary>5 and salary<=10 then 1 else 0 end)').alias('group2'),
                             F.expr('sum(case when salary>10 and salary<=20 then 1 else 0 end)').alias('group3'))
df.show(truncate=False)

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/377867.html

標籤：数据框斯卡拉阿帕奇火花火花数据科学

上一篇：將keyStore和trustStore添加到Gatling請求

下一篇：如何在Scala中更新不可變樹圖中的鍵值