了解包含map和reduce函式的sparksql自定義連接運算式-有解無憂

我是大資料的新手，正在瀏覽現有的代碼庫并試圖理解一段特定的代碼。我在理解用于連接兩個資料幀的連接運算式時感到震驚，其中 reduce 被用作運算式的一部分。這是包含連接運算式的代碼

 def joinOnMultipleColumns(leftDF: Dataset[Row], rightDF: Dataset[Row],
      leftColumns: List[String], rightColumns: List[String]
  ): DataFrame = {
   // Both leftColumns and rightColumns variables are of same length 
    val joinExpression = leftColumns
      .zip(rightColumns)
      .map { case (c1, c2) => col(c1) === col(c2) } 
      .reduce(_ && _)  // -----> what does the map and reduce part mean here

    rightDF.cache.show
    leftDF.join(rightDF, joinExpression)
  }

如果我需要提供任何進一步的資訊，請告訴我

根據我的假設，該函式接收兩個列串列 [t1_col1, t1_col2], [t2_col1, t2_col2] 以及兩個資料幀

zip 將導致 (t1_col1, t2_col1), (t1_col2, t2_col2)
map 和 reduce 組合將創建一個帶有 col1===col2 的連接運算式，但不確定到底發生了什么，我的假設也可能完全錯誤

有人可以幫助我理解代碼的實際作用嗎？

uj5u.com熱心網友回復：

Map 是一個高階函式，它負責通過將傳遞給 map 的函式應用于集合中的每個值來轉換某些集合中的值。讓我們深入研究您的代碼：

假設您有以下值：leftColumns = ["col1_1", "col1_2"] 和 rightColumn = ["col2_1", "col2_2"]
壓縮

leftColumns
.zip(rightColumns)

在這一步中，我們將兩個初始字串集合壓縮到一個元組集合中：[("col1_1", "col2_1"),("col1_2", "col2_2")]

地圖

.map { case (c1, c2) => col(c1) === col(c2) } //

正如我之前所說，使用 map 我們需要對集合中的每個元素應用一些函式。集合的元素是 (String,String) 的元組。函式是一個. 所以這意味著我們正在將 List[(String,String)] 轉換為 List[Column] （因為應用于列將在 scaladoc 中回傳 column ===）col(left)===col(right)===

最后我們會得到：[col("col1_1") === col("col2_1"), col("col1_2") === col("col2_2")]

減少

.reduce(_ && _)

Reduce 負責將值的集合折疊成一個值。在這種情況下，我們通過應用 &&(And 運算子將 List[Column] 折疊為 Column，如果我們將其應用于Scaladoc 中的 Column && ，它將回傳 Column

所以最后我們會得到這個：col("col1_1") === col("col2_1") && col("col1_2") === col("col2_2")這是加入 2 個資料幀的一組條件

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/512164.html

標籤：数据框斯卡拉阿帕奇火花apache-spark-sql

上一篇：將IndexedSeq轉換為陣列

下一篇：無法決議符號“TestKit”