創建SparkDataframe的摘要-有解無憂

賞金將在 4 天后到期。此問題的答案有資格獲得 100聲望賞金。 wBob想引起對這個問題的更多關注：

尋找一種使用 Scala 的高效方法，可選 Spark SQL。

我有一個 Spark Dataframe，我試圖對其進行總結，以便找到過長的列：

// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
    ( 1, "a", "bb", "cc", "file1" ),
    ( 2, "d", "ee", "fff", "file2" ),
    ( 3, "g", "hhhh", "ii", "file3" )
    ).
    toDF("rowId", "col1", "col2", "col3", "filename")

我可以總結列的長度并找到像這樣過長的列：

// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
    .map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
    .toSeq.toDF("columnName", "maxLength")
    .filter($"maxLength" > 2)

如果我嘗試將現有檔案名列添加到地圖中，則會收到錯誤訊息：

val df2 = df.columns
    .map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
    .toSeq.toDF("fn", "columnName", "maxLength")
    .filter($"maxLength" > 2)

我嘗試了幾種$"filename"語法變體。如何將filename列合并到摘要中？

列名	最長長度	檔案名
col2	4	檔案 3
col3	3	檔案2

真正的資料框有 300 多列和數百萬行，所以我不能硬輸入列名。

uj5u.com熱心網友回復：

@wBob 以下是否實作了您的目標？

按檔案名分組并獲得每列的最大值：

    val cols = df.columns.dropRight(1) // to remove the filename col
    val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
    print(maxLength)
    df.createOrReplaceTempView("temp")
    val df1 = spark
      .sql(s"select filename, ${maxLength} from temp group by filename")
    df1.show()`

輸出：

 -------- ----- ---- ---- ---- 
|filename|rowId|col1|col2|col3|
 -------- ----- ---- ---- ---- 
|   file1|    1|   1|   2|   2|
|   file2|    1|   1|   2|   3|
|   file3|    1|   1|   4|   2|
 -------- ----- ---- ---- ----

使用子查詢獲取每列的最大值并使用聯合連接結果：

    df1.createOrReplaceTempView("temp2")
    val res = cols.map(col => {
      spark.sql(s"select '${col}' as columnName,  $col as maxLength, filename from temp2 "  
        s"where $col = (select max(${col}) from temp2)")
    }).reduce(_ union _)
    res.show()

結果：

 ---------- --------- -------- 
|columnName|maxLength|filename|
 ---------- --------- -------- 
|     rowId|        1|   file1|
|     rowId|        1|   file2|
|     rowId|        1|   file3|
|      col1|        1|   file1|
|      col1|        1|   file2|
|      col1|        1|   file3|
|      col2|        4|   file3|
|      col3|        3|   file2|
 ---------- --------- --------

請注意，有多個條目rowId，col1因為最大值不是唯一的。

可能有一種更優雅的方式來撰寫它，但我目前正在努力尋找一種方式。

uj5u.com熱心網友回復：

按總文本長度對表格進行排序可能就足夠了。這可以快速而簡潔地實作。

df.select( 
  col("*"), 
  length( // take the length
    concat(   //slap all the columns together
      (for( col_name <- df.columns ) yield col(col_name)).toSeq:_*  
    )
  )
  .as("length") 
)
.sort( //order by total length
  col("length").desc
).show()
 ----- ---- ---- ---- -------- ------ 
|rowId|col1|col2|col3|filename|length|
 ----- ---- ---- ---- -------- ------ 
|    3|   g|hhhh|  ii|   file3|    13|
|    2|   d|  ee| fff|   file2|    12|
|    1|   a|  bb|  cc|   file1|    11|
 ----- ---- ---- ---- -------- ------

uj5u.com熱心網友回復：

對陣列 [struct] 進行排序，它將在第一個欄位的第一個欄位和第二個欄位上進行排序。當我們將刺痛的大小放在前面時，這很有效。如果您重新排序欄位，您將獲得不同的結果。如果您愿意，您可以輕松接受超過 1 個結果，但我認為發現一行可能就足夠了。

df.select(  
  col("*"), 
  reverse( //sort ascending
    sort_array( //sort descending
      array( // add all columns lengths to an array
        (for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
    )
  )(0) // grab the row max
  .alias("rowMax") )
  .sort("rowMax").show
 ----- ---- ---- ---- -------- -------------------- 
|rowId|col1|col2|col3|filename|              rowMax|
 ----- ---- ---- ---- -------- -------------------- 
|    1|   a|  bb|  cc|   file1|[5, filename, file1]|
|    2|   d|  ee| fff|   file2|[5, filename, file2]|
|    3|   g|hhhh|  ii|   file3|[5, filename, file3]|
 ----- ---- ---- ---- -------- --------------------

uj5u.com熱心網友回復：

再往前推一點，效果會更好。

df.select(  
 col("*"), 
 array( // make array of columns name/value/length
  (for{ col_name <- df.columns  } yield 
   struct(
    length(col(col_name)).as("length"),
    lit(col_name).as("col"),
    col(col_name).cast("String").as("col_value")
   )  
  ).toSeq:_* ).alias("rowInfo") 
 )
 .select(
  col("rowId"),
  explode( // explode array into rows
   expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
  ).as("rowInfo") 
 )
 .select(
  col("rowId"),
  col("rowInfo.*") // turn struct fields into columns
 )
 .sort("length").show

 ----- ------ -------- --------- 
|rowId|length|     col|col_value|
 ----- ------ -------- --------- 
|    2|     3|    col3|      fff|
|    3|     4|    col2|     hhhh|
|    3|     5|filename|    file3|
|    1|     5|filename|    file1|
|    2|     5|filename|    file2|
 ----- ------ -------- ---------

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/479945.html

標籤：斯卡拉阿帕奇火花 apache-spark-sql 数据块天蓝色数据块

上一篇：將串列列與PySpark中的字串列連接起來

下一篇：如何在aws膠水作業中覆寫couchbase查詢超時？