我有以下代碼來清理檔案語料庫 ( pipelineClean(corpus)),該語料庫回傳一個包含兩列的 Dataframe:
- “id”:長
- “令牌”:陣列[字串]。
之后,代碼生成一個包含以下列的資料框:
- “術語”:字串
- "postingList": List[Array[Long, Long]] (第一個 long 是檔案中的另一個詞頻)
pipelineClean(corpus)
.select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column
.groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting
.where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
.groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list
.orderBy("term")
.persist(StorageLevel.MEMORY_ONLY_SER)
我的問題是:是否有可能使這段代碼更短和/或更高效?例如,是否可以在單個內進行分組groupBy?
uj5u.com熱心網友回復:
除了跳過withColumn呼叫和使用直接選擇之外,您似乎無法做更多的事情:
.select(col("term"), struct(col("documentId"), col("count")) as "posting")
代替
.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq}
.select("term", "posting")
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/327804.html
上一篇:不變性和記憶體使用
