Sparkorc將字串讀取為十進制值-有解無憂

我正在使用以下資料讀取 orc 檔案

| C1 | C2 |

| 1 | 1954E7 |

我的列 c1 應該是 int 而 c2 應該是 string 但 spark 將 c2 解釋為十進制。我嘗試以下代碼來克服它

spark.read.option("inferSchema","false").option("header", "true").orc("path to file")

但是即使我強制它關閉推斷模式，spark orc reader 仍然使用模式讀取資料。有沒有辦法強制 spark 不讀取模式，然后在讀取后應用我的自定義模式？

uj5u.com熱心網友回復：

import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.{SparkSession}

// spark: SparkSession
import spark.implicits._

當 C2 是字串時

val pathORC =
  "<path>/source.orc"

case class O(C1: Int, C2: String)
val source = Seq(O(1, "1954E7")).toDF()

source.printSchema()
//    root
//    |-- C1: integer (nullable = false)
//    |-- C2: string (nullable = true)

source.show(false)
//     --- ------ 
//    |C1 |C2    |
//     --- ------ 
//    |1  |1954E7|
//     --- ------ 

source.write.mode("overwrite").orc(pathORC)
val res = spark.read.orc(pathORC)
res.printSchema()
//    root
//    |-- C1: integer (nullable = true)
//    |-- C2: string (nullable = true)

res.show(false)
//     --- ------ 
//    |C1 |C2    |
//     --- ------ 
//    |1  |1954E7|
//     --- ------

當C2？？？

val pathORC1 =
  "<path>/source1.orc"

val source1 = Seq((1, 1954e7)).toDF("C1", "C2")
source1.printSchema()
//    root
//    |-- C1: integer (nullable = false)
//    |-- C2: double (nullable = false)

source1.show(false)
//     --- -------- 
//    |C1 |C2      |
//     --- -------- 
//    |1  |1.954E10|
//     --- -------- 

source1.write.mode("overwrite").orc(pathORC1)
val res1 = spark.read.orc(pathORC1)
res1.printSchema()
//    root
//    |-- C1: integer (nullable = true)
//    |-- C2: double (nullable = true)

res1.show(false)
//     --- -------- 
//    |C1 |C2      |
//     --- -------- 
//    |1  |1.954E10|
//     --- -------- 

val dToStr = udf( (v: Double) => { v.toString.replace(".", "") } )
val res2 = res1
  .withColumn("C2", dToStr(col("C2")))

res2.printSchema()
//    root
//    |-- C1: integer (nullable = true)
//    |-- C2: string (nullable = true)

res2.show(false)
//     --- ------- 
//    |C1 |C2     |
//     --- ------- 
//    |1  |1954E10|
//     --- -------

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/486895.html

標籤：斯卡拉阿帕奇火花 apache-spark-sql

上一篇：創建pyspark資料框

下一篇：無法選擇幾列