檢查資料框中的重復值并實作ignoreNulls引數-有解無憂

我創建了一個函式來檢查基于列 Seq 的資料框中是否存在重復值。

我想實作一個“ignoreNulls”，作為布爾引數傳遞給函式

如果為真，將忽略并且不對空值進行分組和計數。因此對于空值，“newColName”將回傳 false。
如果為 false（默認），則將 null 值視為一個組，如果我正在檢查的鍵有多個值為 null 的值，則回傳 true。

我不知道我該怎么做。我應該使用iforcase嗎？有一些運算式可以忽略 partitionBy 陳述句上的空值嗎？

任何人都可以幫助我嗎？

這是當前功能

def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
    val repeatedCondition = $"sum" > 1
    val windowCondition   = Window.partitionBy(keys.head, keysToCheck.tail: _*)

    dataframe
      .withColumn("count", lit(1))
      .withColumn("sum", sum("count").over(windowCondition))
      .withColumn(newColName, repeatedCondition)
      .drop("count", "sum")
  }

一些測驗資料

  val testDF = Seq(
      ("1", Some("name-1")),
      ("2", Some("repeated-name")),
      ("3", Some("repeated-name")),
      ("4", Some("name-4")),
      ("5", None),
      ("6", None)
    ).toDF("name_key", "name")

測驗功能

val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))

輸出（沒有 ignoreNulls 實作）

 -------- --------------- -------------------- 
|name_key|       name    |  has_repeated_name |
 -------- --------------- -------------------- 
|     1  |      name-1   |              false |
 -------- --------------- -------------------- 
|     2  | repeated-name |               true |
 -------- --------------- -------------------- 
|     3  | repeated-name |               true |
 -------- --------------- -------------------- 
|     4  |      name-4   |              false |
 -------- --------------- -------------------- 
|     5  |         null  |               true |
 -------- --------------- -------------------- 
|     6  |         null  |               true |
 -------- --------------- --------------------

并且使用 ignoreNulls=true 實作應該是這樣的


-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = 

-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)

-- expected output for nulls
 -------- --------------- -------------------- 
|     5  |         null  |              false |
 -------- --------------- -------------------- 
|     6  |         null  |              false |
 -------- --------------- --------------------

uj5u.com熱心網友回復：

首先，您應該正確定義邏輯，以防只有部分keys列為空-應該將其計為空值還是僅當所有列都為空時才定義空值keys？

為簡單起見，我們假設其中只有一列keys（您可以輕松地為多列擴展邏輯）。你可以在if你的checkRepeatedKey函式中添加一個簡單的：

def checkIfNullValue(keys: Seq[String]): Column = {
// for the sake of simplicity checking only the first key 
    col(keys.head).isNull
}


def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = {
    ...
    ...

    val df = dataframe
      .withColumn("count", lit(1))
      .withColumn("sum", sum("count").over(windowCondition))
      .withColumn(newColName, repeatedCondition)
      .drop("count", "sum")
    
    if (ignoreNulls) 
        df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false)) 
    else df
  }

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/524178.html

標籤：数据框斯卡拉阿帕奇火花

上一篇：我想禁止使用已棄用的API

下一篇：將函式應用于Scala中的串列