按兩個值對rdd排序并獲得每組前10名-有解無憂

假設我在 pyspark 中有以下 RDD，其中每一行都是一個串列：

[foo, apple]
[foo, orange]
[foo, apple]
[foo, apple]
[foo, grape]
[foo, grape]
[foo, plum]
[bar, orange]
[bar, orange]
[bar, orange]
[bar, grape]
[bar, apple]
[bar, apple]
[bar, plum]
[scrog, apple]
[scrog, apple]
[scrog, orange]
[scrog, orange]
[scrog, grape]
[scrog, plum]

我想顯示每組（索引 0）的前 3 個水果（索引 1），按水果數量排序。假設為了簡單起見，不太關心關系（例如scrog，計數為 1grape和plum；不關心哪個）。

我的目標是輸出如下：

foo, apple, 3
foo, grape, 2
foo, orange, 1
bar, orange, 3
bar, apple, 2
bar, plum, 1   # <------- NOTE: could also be "grape" of count 1
scrog, orange, 2  # <---------- NOTE: "scrog" has many ties, which is okay
scrog, apple, 2
scrog, grape, 1

我可以想到一種可能效率低下的方法：

獲取唯一組并.collect()作為串列
按組過濾rdd，對水果進行計數和排序
使用類似的東西zipWithIndex()來獲得前 3 名
另存為帶格式的新 RDD(<group>, <fruit>, <count>)
最后聯合所有 RDD

但我不僅對更多特定于 spark 的方法感興趣，而且對可能跳過昂貴操作的方法感興趣，例如collect()and zipWithIndex()。

作為獎勵——但不是必需的——如果我確實想對地址關聯應用排序/過濾，這可能是最好的完成。

非常感謝任何建議！

更新：在我的背景關系中，無法使用資料框；必須僅使用 RDD。

uj5u.com熱心網友回復：

`mapreduceByKey`pyspark 中的操作

用將計數相加，用.reduceByKey分組，用和.groupByKey選擇每組的前 3 個。.mapheapq.nlargest

rdd = sc.parallelize([
    ["foo", "apple"], ["foo", "orange"], ["foo", "apple"], ["foo", "apple"],
    ["foo", "grape"], ["foo", "grape"], ["foo", "plum"], ["bar", "orange"],
    ["bar", "orange"], ["bar", "orange"], ["bar", "grape"], ["bar", "apple"],
    ["bar", "apple"], ["bar", "plum"], ["scrog", "apple"], ["scrog", "apple"],
    ["scrog", "orange"], ["scrog", "orange"], ["scrog", "grape"], ["scrog", "plum"]
])

from operator import add
from heapq import nlargest

n = 3

results = rdd.map(lambda x: ((x[0], x[1]), 1)).reduceByKey(add) \
             .map(lambda x: (x[0][0], (x[1], x[0][1]))).groupByKey() \
             .map(lambda x: (x[0], nlargest(n, x[1])))

print( results.collect() )
# [('bar', [(3, 'orange'), (2, 'apple'), (1, 'plum')]),
#  ('scrog', [(2, 'orange'), (2, 'apple'), (1, 'plum')]),
#  ('foo', [(3, 'apple'), (2, 'grape'), (1, 'plum')])]

標準蟒蛇

為了比較，如果你有一個簡單的 python 串列而不是 rdd，在 python 中進行分組的最簡單方法是使用字典：

data = [
    ["foo", "apple"], ["foo", "orange"], ["foo", "apple"], ["foo", "apple"],
    ["foo", "grape"], ["foo", "grape"], ["foo", "plum"], ["bar", "orange"],
    ["bar", "orange"], ["bar", "orange"], ["bar", "grape"], ["bar", "apple"],
    ["bar", "apple"], ["bar", "plum"], ["scrog", "apple"], ["scrog", "apple"],
    ["scrog", "orange"], ["scrog", "orange"], ["scrog", "grape"], ["scrog", "plum"]
]

from heapq import nlargest
from operator import itemgetter

d = {}
for k,v in data:
    d.setdefault(k, {})
    d[k][v] = d[k].get(v, 0)   1
print(d)
# {'foo': {'apple': 3, 'orange': 1, 'grape': 2, 'plum': 1}, 'bar': {'orange': 3, 'grape': 1, 'apple': 2, 'plum': 1}, 'scrog': {'apple': 2, 'orange': 2, 'grape': 1, 'plum': 1}}

n = 3
results = [(k,v,c) for k,subd in d.items()
                   for v,c in nlargest(n, subd.items(), key=itemgetter(1))]
print(results)
# [('foo', 'apple', 3), ('foo', 'grape', 2), ('foo', 'orange', 1), ('bar', 'orange', 3), ('bar', 'apple', 2), ('bar', 'grape', 1), ('scrog', 'apple', 2), ('scrog', 'orange', 2), ('scrog', 'grape', 1)]

uj5u.com熱心網友回復：

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
spark = (SparkSession.builder.appName("foo").getOrCreate())

initial_list = [["foo", "apple"], ["foo", "orange"],
            ["foo", "apple"], ["foo", "apple"],
            ["foo", "grape"], ["foo", "grape"],
            ["foo", "plum"], ["bar", "orange"],
            ["bar", "orange"], ["bar", "orange"],
            ["bar", "grape"], ["bar", "apple"],
            ["bar", "apple"], ["bar", "plum"],
            ["scrog", "apple"], ["scrog", "apple"],
            ["scrog", "orange"], ["scrog", "orange"],
            ["scrog", "grape"], ["scrog", "plum"]]
# creating rdd
rdd = spark.sparkContext.parallelize(initial_list)
# converting rdd to dataframe
df = rdd.toDF()

# group by index 0 and index 1 to get count of each
df2 = df.groupby(df._1, df._2).count()

window = Window.partitionBy(df2['_1']).orderBy(df2['count'].desc())
# picking only first 3 from decreasing order of count
df3 = df2.select('*',         rank().over(window).alias('rank')).filter(col('rank') <= 3)
# display the reqruired dataframe
df3.select('_1', '_2', 'count').orderBy('_1', col('count').desc()).show()

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/441067.html

標籤：排序 pyspark 分组 rdd 数数

上一篇：根據Python中特定索引處的最大值對串列進行排序，無需使用預構建的函式或方法

下一篇：嘗試在Pandas中按日期對單元格中的資料進行排序

按兩個值對rdd排序并獲得每組前10名

mapreduceByKeypyspark 中的操作

標準蟒蛇

`mapreduceByKey`pyspark 中的操作