我有 csv 檔案:dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
此表的 csv 格式:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
然后我通過讀取它來創建一個資料框:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
現在我需要基于dfDL創建一個新的資料框。
新資料框的結構如下所示:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
新 DataFrame 的欄位資訊是從 csv 檔案中獲取的:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
我需要創建一個新的資料框。我一個人應付不了。
PS 我需要這個資料框來稍后從它創建一個 json 檔案。示例 JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
我不喜歡我目前的實作:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ _)
}
uj5u.com熱心網友回復:
您可以直接使用資料集。
import spark.implicits._
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
val filename = "dbname1.table1.csv"
val df = spark.read.option("header","true").csv("test.csv")
df.show(false)
------------------ ------------- ------------------------------------------ -------------
|target |source |source_table |relation_type|
------------------ ------------- ------------------------------------------ -------------
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|protocol_dttm|custom_cib_ml_stg.p_overall_part_tend_cust|direct |
|avg_ensure_sum_12m|inn_num |custom_cib_ml_stg.p_overall_part_tend_cust|indirect |
------------------ ------------- ------------------------------------------ -------------
df.createOrReplaceTempView("table")
val df2 = spark.sql(s"""
select split(source_table, '[.]')[0] as schema_from
, split(source_table, '[.]')[1] as table_from
, source as column_from
, relation_type as link_type
, split('${filename}', '[.]')[0] as schema_to
, split('${filename}', '[.]')[1] as table_to
, target as column_to
from table
""").as[DataLink]
df2.show()
----------------- -------------------- ------------- --------- --------- -------- ------------------
| schema_from| table_from| column_from|link_type|schema_to|table_to| column_to|
----------------- -------------------- ------------- --------- --------- -------- ------------------
|custom_cib_ml_stg|p_overall_part_te...| inn_num| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...|protocol_dttm| direct| dbname1| table1|avg_ensure_sum_12m|
|custom_cib_ml_stg|p_overall_part_te...| inn_num| indirect| dbname1| table1|avg_ensure_sum_12m|
----------------- -------------------- ------------- --------- --------- -------- ------------------
uj5u.com熱心網友回復:
你絕對不想收集,這就失去了在這里使用火花的意義。與 Spark 一樣,您有很多選擇。您可以使用 RDD,但我認為這里不需要在模式之間切換。您只想將自定義邏輯應用于某些列,并最終得到一個帶有結果列的資料框。
首先,定義一個UDF你想要應用的:
def convert(target, source, source_table, relation_type) =
DataLink(source_table.split("\\.")(0),
source_table.split("\\.")(1),
source,
"dbname1.table1.csv".split(".")(0)
"dbname1.table1.csv".split(".")(1)
target))
然后將此函式應用于所有相關列(確保將其包裝起來udf以使其成為 spark 函式而不是普通的 Scala 函式)和select結果:
df.select(udf(convert)($"target", $"source", $"source_table", $"relation_type"))
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/372776.html
標籤:数据框 斯卡拉 阿帕奇火花 apache-spark-sql
上一篇:將包含值的列作為串列轉換為陣列
