計算PySpark中二元組的相對頻率-有解無憂

我正在嘗試計算文本檔案中的單詞對。首先，我對文本做了一些預處理，然后我統計了如下所示的詞對：

((Aspire, to), 1) ; ((to, inspire), 4) ; ((inspire, before), 38)...

現在，我想報告 1000 個最常見的對，按以下順序排序：

字（該對的第二個字）
相對頻率（對出現次數/第二個詞總出現次數）

這是我到目前為止所做的

from pyspark.sql import SparkSession
import re

spark = SparkSession.builder.appName("Bigram occurences and relative frequencies").master("local[*]").getOrCreate()
sc = spark.sparkContext
text = sc.textFile("big.txt")

tokens = text.map(lambda x: x.lower()).map(lambda x: re.split("[\s,.;:!?] ", x))
pairs = tokens.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:]))).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x   y)
frame = pairs.toDF(['pair', 'count'])

# Dataframe ordered by the most frequent pair to the least
most_frequent = frame.sort(frame['count'].desc())
# For each row, trying to add a column with the relative frequency, but I'm getting an error
with_rf = frame.withColumn("rf", frame['count'] / (frame.pair._2.sum()))

我認為我比較接近我想要的結果，但我無法弄清楚。我一般是 Spark 和 DataFrames 的新手。我也試過

import pyspark.sql.functions as F
frame.groupBy(frame['pair._2']).agg((F.col('count') / F.sum('count')).alias('rf')).show()

任何幫助，將不勝感激。

編輯：這是frame資料框的示例

 -------------------- ----- 
|                pair|count|
 -------------------- ----- 
|{project, gutenberg}|   69|
|  {gutenberg, ebook}|   14|
|         {ebook, of}|    5|
|    {adventures, of}|    6|
|           {by, sir}|   12|
|     {conan, doyle)}|    1|
|     {changing, all}|    2|
|         {all, over}|   24|
 -------------------- ----- 

root
 |-- pair: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
 |-- count: long (nullable = true)

uj5u.com熱心網友回復：

在relative frequency可以通過使用被計算window的功能，即通過磁區的第二個字中pair，并應用sum操作。

然后，我們將 df 中的條目限制在頂部 x，基于count并最終按成對的第二個單詞和相對頻率排序。

from pyspark.sql import functions as F
from pyspark.sql import Window as W

data = [(("project", "gutenberg"), 69,),
        (("gutenberg", "ebook"), 14,),
        (("ebook", "of"), 5,),
        (("adventures", "of"), 6,),
        (("by", "sir"), 12,),
        (("conan", "doyle"), 1,),
        (("changing", "all"), 2,),
        (("all", "over"), 24,), ]

df = spark.createDataFrame(data, ("pair", "count", ))

ws = W.partitionBy(F.col("pair")._2).rowsBetween(W.unboundedPreceding, W.unboundedFollowing)

(df.withColumn("relative_freq", F.col("count") / F.sum("count").over(ws))
   .orderBy(F.col("count").desc())
   .limit(3) # change here to select top 1000
   .orderBy(F.desc(F.col("pair")._2), F.col("relative_freq").desc())
).show()

"""
 -------------------- ----- ------------- 
|                pair|count|relative_freq|
 -------------------- ----- ------------- 
|         {all, over}|   24|          1.0|
|{project, gutenberg}|   69|          1.0|
|  {gutenberg, ebook}|   14|          1.0|
 -------------------- ----- ------------- 
"""

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/400122.html

標籤：Python 阿帕奇火花火花

上一篇：spark寫入為字串并讀取磁區列為數字

下一篇：在執行中使用sparksql錯誤的sql請求