我正在從下面的代碼片段中讀取 CSV 檔案
df_pyspark = spark.read.csv("sample_data.csv") df_pyspark
當我嘗試列印資料幀時,它的輸出如下所示:
DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]
對于每一列資料型別顯示“字串”,即使列包含不同的資料型別,如下所示:
df_pyspark.show()
|_c0| _c1| _c2| _c3| _c4| _c5|
--- ---------- --------- -------------------- ----------- ----------
| id|first_name|last_name| email| gender| phone|
| 1| Bidget| Mirfield|bmirfield0@scient...| Female|5628618353|
| 2| Gonzalo| Vango| [email protected]| Male|9556535457|
| 3| Rock| Pampling|rpampling2@guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3@histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4@google....|Genderqueer|1348658809|
| 6| Debee| Philcott|dphilcott5@cafepr...| Female|7906881842|`
我想列印每一列的確切資料型別?
uj5u.com熱心網友回復:
在讀取 CSV 檔案期間使用 inferSchema 引數,它將根據列中的值顯示準確/正確的資料型別
df_pyspark = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
--- ---------- --------- -------------------- ----------- ----------
| id|first_name|last_name| email| gender| phone|
--- ---------- --------- -------------------- ----------- ----------
| 1| Bidget| Mirfield|bmirfield0@scient...| Female|5628618353|
| 2| Gonzalo| Vango| [email protected]| Male|9556535457|
| 3| Rock| Pampling|rpampling2@guardi...| Bigender|4472741337|
| 4| Dorella| Edelman|dedelman3@histats...| Female|4303062344|
| 5| Faber| Thwaite|fthwaite4@google....|Genderqueer|1348658809|
--- ---------- --------- -------------------- ----------- ----------
only showing top 5 rows
df_pyspark.printSchema()
root
|-- id: integer (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- gender: string (nullable = true)
|-- phone: long (nullable = true)
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/525774.html
