如何基于舊的DataFrame創建新的DataFame？-有解無憂

我有 csv 檔案：dbname1.table1.csv：

|target            | source        |source_table                       |relation_type|
 ---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num       | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num       | custom_cib_ml_stg.p_overall_part_tend_cust | indirect

此表的 csv 格式：

target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect

然后我通過讀取它來創建一個資料框：

 val dfDL = spark.read.option("delimiter", ",")
                     .option("header", true)
                     .csv(file.getPath.toUri.getPath)

現在我需要基于dfDL創建一個新的資料框。

新資料框的結構如下所示：

case class DataLink(schema_from: String,
                    table_from: String,
                    column_from: String,
                    link_type: String,
                    schema_to: String,
                    table_to: String,
                    column_to: String)

新 DataFrame 的欄位資訊是從 csv 檔案中獲取的：

pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from  = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source                     // Example: inn_num
link_type   = relation_type              // Example: direct
schema_to   = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to    = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to   = target                             // Example: avg_ensure_sum_12m

我需要創建一個新的資料框。我一個人應付不了。

PS 我需要這個資料框來稍后從它創建一個 json 檔案。示例 JSON：

[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}

我不喜歡我目前的實作：

def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {

    val arrTableName        = file.getPath.getName.split("\\.")
    val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))

    val dfDL = spark.read.option("delimiter", ",")
                         .option("header", true)
                         .csv(file.getPath.toUri.getPath)

    //val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))

    dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
                                     row.getString(2).split("\\.")(1),
                                     row.getString(1),
                                     row.getString(3),
                                     schemaTo,
                                     tableTo,
                                     row.getString(0)))
  }

  def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
    dataLinks.map(Extraction.decompose).reduceOption(_    _)

}

uj5u.com熱心網友回復：

您可以直接使用資料集。

import spark.implicits._

case class DataLink(schema_from: String,
                    table_from: String,
                    column_from: String,
                    link_type: String,
                    schema_to: String,
                    table_to: String,
                    column_to: String)

val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
 ------------------ ------------- ------------------------------------------ ------------- 
|target            |source       |source_table                              |relation_type|
 ------------------ ------------- ------------------------------------------ ------------- 
|avg_ensure_sum_12m|inn_num      |custom_cib_ml_stg.p_overall_part_tend_cust|direct       |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct       |
|avg_ensure_sum_12m|inn_num      |custom_cib_ml_stg.p_overall_part_tend_cust|indirect     |
 ------------------ ------------- ------------------------------------------ ------------- 

df.createOrReplaceTempView("table")

val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
     , split(source_table, '[.]')[1] as table_from
     , source                        as column_from
     , relation_type                 as link_type
     , split('${filename}', '[.]')[0] as schema_to
     , split('${filename}', '[.]')[1] as table_to
     , target                        as column_to
  from table
""").as[DataLink]

df2.show()

 ----------------- -------------------- ------------- --------- --------- -------- ------------------ 
|      schema_from|          table_from|  column_from|link_type|schema_to|table_to|         column_to|
 ----------------- -------------------- ------------- --------- --------- -------- ------------------ 
|custom_cib_ml_stg|p_overall_part_te...|      inn_num|   direct|  dbname1|  table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm|   direct|  dbname1|  table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|      inn_num| indirect|  dbname1|  table1|avg_ensure_sum_12m|
 ----------------- -------------------- ------------- --------- --------- -------- ------------------

uj5u.com熱心網友回復：

你絕對不想收集，這就失去了在這里使用火花的意義。與 Spark 一樣，您有很多選擇。您可以使用 RDD，但我認為這里不需要在模式之間切換。您只想將自定義邏輯應用于某些列，并最終得到一個帶有結果列的資料框。

首先，定義一個UDF你想要應用的：

def convert(target, source, source_table, relation_type) =
  DataLink(source_table.split("\\.")(0),
           source_table.split("\\.")(1),
           source,
           "dbname1.table1.csv".split(".")(0)
           "dbname1.table1.csv".split(".")(1)
           target))

然后將此函式應用于所有相關列（確保將其包裝起來udf以使其成為 spark 函式而不是普通的 Scala 函式）和select結果：

df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/372776.html

標籤：数据框斯卡拉阿帕奇火花 apache-spark-sql

上一篇：將包含值的列作為串列轉換為陣列

下一篇：在Scala中創建嵌套的JavaTreeMap