我是新來的火花,并試圖計算名稱串列中每個字母的頻率,然后對前 10 個字母進行排名。我在構建元組時遇到了麻煩,有人可以幫忙嗎?
rdd_1 = sc.parallelize(['Scott', 'Steven', 'Sara', 'Mike', 'Mary', 'Joe', 'Jake'])
letters = rdd_1.flatMap (lambda x: list(x.lower()))
letters.collect()
字母的輸出是:
['s','c','o','t','t','s','t','e','v','e','n','s',' a','r','a','m','i','k','e','m','a','r','y','j','o' , 'e', 'j', 'a', 'k', 'e']
instances1 = letters.map (lambda letr: (letr, 1))
aggCounts1 = instances1.reduceByKey (lambda x, y: x y)
aggCounts1.collect()
aggCounts1.collect() 的輸出是:
[('s', 3), ('r', 2), ('i', 1), ('y', 1), ('e', 5), ('a', 4), ( 'm', 2), ('j', 2), ('t', 3), ('n', 1), ('k', 2), ('c', 1), ('o ', 2), ('v', 1)]
我想找到前 10 個單詞,然后對它們進行排名
topWords = aggCounts1.top (10, lambda x : x[1])
topWords[:3]
前 3 個單詞:[('e', 5), ('a', 4), ('s', 3)]
topTen = sc.parallelize(range(10))
這是我為元組結果所做的:
# this is incorrect syntax
result = topTen.map (lambda ltrs,nums: ltrs for ltrs in topWords and nums in topTen (topWords[0], topTen) )
我試圖得到這樣的東西:
[('e', 0), ('a', 1), ('s', 2), ('t', 3), ('r', 4), ('m', 5), ( 'j', 6), ('k', 7), ('o', 8), ('i', 9)]
uj5u.com熱心網友回復:
您可以zipWithIndex用作最后一步,然后map相應地使用。
請參閱https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.zipWithIndex.html。
這里的狹義轉變。
更新
以正確的方式,因為您有一個串列。
完整代碼:
%python
rdd_1 = sc.parallelize(['Scott', 'Steven', 'Sara', 'Mike', 'Mary', 'Joe', 'Jake'])
letters = rdd_1.flatMap (lambda x: list(x.lower())) letters.collect()
instances1 = letters.map (lambda letr: (letr, 1))
aggCounts1 = instances1.reduceByKey (lambda x, y: x y)
topWords2 = aggCounts1.sortBy(lambda x: (-x[1], x[0])).zipWithIndex().map(lambda x: (x[0][0],x[1]))
topWords2.take(20)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/488043.html
上一篇:有條件地合并陣列
下一篇:如何連續計算當天和前一天的值?
