我有以下資料框:
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","bad","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| James | | Smith|36636| M| 3000|
| Michael | Rose| |40288| M| 4000|
| Robert | |Williams|42114| M| 4000|
| Maria | Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown|Rose | F| -1|
--------- ---------- -------- ----- ------ ------
我正在下面運行示例代碼,我想在轉換后將字串列轉換為整數。
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
-----
| id|
-----
|36636|
|40288|
|42114|
|39192|
| null|
-----
此處所有整數資料都正確轉換,但“Rose”轉換為 null。
每當有不良記錄時,您能否幫助我如何拋出例外?是否有任何火花配置設定?
此外,如果查詢中有多個強制轉換,如何獲取出現此問題的確切列名。
uj5u.com熱心網友回復:
由于 Spark 3.0 和票證SPARK-30292 的更正,當您嘗試將無效字串轉換為數字時,將spark.sql.ansi.enabledconfig設定為true將引發例外:
spark.conf.set("spark.sql.ansi.enabled", "true")
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
拋出一個NumberFormatException. 有關更多詳細資訊,請參閱https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast。
uj5u.com熱心網友回復:
如果轉換出錯,Spark 不會拋出。
作為捕獲這些錯誤的自定義方法,您可以撰寫一個UDF,如果您強制轉換為 null,則會拋出該錯誤。但是,這會降低腳本的性能,因為 Spark 無法優化 UDF 執行。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/365952.html
