將CSV讀取到記錄中具有更少標題和更多值的資料幀-有解無憂

如何在 Spark 中讀取具有如下結構的 csv 檔案：

id,name,address
1,"ashu","building","street","area","city","state","pin"

使用閱讀器時：

val df = spark.read.option("header",true).csv("input/input1.csv")

我正在記錄到 CSV 中的第三個值。

 --- ---- -------- 
| id|name| address|
 --- ---- -------- 
|  1|ashu|building|
 --- ---- --------

如何要求 Spark 讀取從第三個值開始到單個資料框列中最后一個值的所有值，address例如：

 --- ---- ----------------------------------------------- 
| id|name| address                                       |
 --- ---- ----------------------------------------------- 
|  1|ashu|"building","street","area","city","state","pin"|
 --- ---- -----------------------------------------------

uj5u.com熱心網友回復：

我正在使我的答案符合您使用 CSV 的要求。這是做你想做的最不痛苦的方式。

修改您的 CSV 檔案，使其使用“|” 拆分欄位而不是“，”。這將允許您在列中包含“，”。

id,name,address
1|"ashu"|"building","street","area","city","state","pin"

修改你的代碼：

val df = spark.read
      .option("header",true)
      .option("delimiter", '|')
      .csv("input/input1.csv")

uj5u.com熱心網友回復：

如果您可以修復輸入檔案以使用另一個分隔符，那么您應該這樣做。

但是，如果您沒有這種可能性，您仍然可以讀取沒有標題的檔案并指定自定義架構。然后，連接 6address列以獲得所需的資料幀：

import org.apache.spark.sql.types._

val schema = StructType(
  Array(
    StructField("id", IntegerType, true),
    StructField("name", StringType, true),
    StructField("address1", StringType, true),
    StructField("address2", StringType, true),
    StructField("address3", StringType, true),
    StructField("address4", StringType, true),
    StructField("address5", StringType, true),
    StructField("address6", StringType, true)
  )
)

val input = spark.read.schema(schema).csv("input/input1.csv")

val df = input.filter("name != 'name'").withColumn(
  "address",
  concat_ws(", ", (1 to 6).map(n => col(s"address$n")):_*)
).select("id", "name", "address")

df.show(false)

// --- ---- ---------------------------------------- 
//|id |name|address                                 |
// --- ---- ---------------------------------------- 
//|1  |ashu|building, street, area, city, state, pin|
// --- ---- ----------------------------------------

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/421884.html

標籤：

上一篇：Scala：使用spark3.1.2決議時間戳

下一篇：將列中的不同值拆分為多列