我必須讀取一個 csv 檔案,并且必須驗證資料框的名稱和列數。最少的列數是 3,它們必須是:'id'、'name' 和 'phone'。有更多的列沒有問題。但它始終需要至少有 3 列具有確切名稱。否則,程式應該失敗。
例如: 正確:
----- ----- ----- ----- ----- ----- -----
| id| name|phone| | id| name|phone|unit |
----- ----- ----- ----- ----- ----- -----
|3940A|jhon |1345 | |3940A|jhon |1345 | 222 |
|2BB56|mike | 492 | |2BB56|mike | 492 | 333 |
|3(401|jose |2938 | |3(401|jose |2938 | 444 |
----- ----- ----- ----- ----- ----- -----
不正確:
----- ----- ----- ----- -----
| sku| nomb|phone| | sku| name|
----- ----- ----- ----- -----
|3940A|jhon |1345 | |3940A|jhon |
|2BB56|mike | 492 | |2BB56|mike |
|3(401|jose |2938 | |3(401|jose |
----- ----- ----- ----- -----
uj5u.com熱心網友回復:
使用簡單的 python if-else 陳述句應該可以完成這項作業:
mandatory_cols = ["id", "name", "phone"]
if all(c in df.columns for c in mandatory_cols):
# your logic
else:
raise ValueError("missing columns!")
uj5u.com熱心網友回復:
這是一個關于如何檢查資料框中是否存在列的示例:
from pyspark.sql import Row
def check_columns_exits(cols):
if 'id' in cols and 'name' in cols and 'phone' in cols:
print("All required columns are present")
else:
print("Does not have all the required columns")
data = [Row(id="3940A", name="john", phone="1345", unit=222),
Row(id="2BB56", name="mike", phone="492", unit=333)]
df = spark.createDataFrame(data)
check_columns_exits(df.columns)
data1 = [Row(id="3940A", name="john", unit=222),
Row(id="2BB56", name="mike", unit=333)]
df1 = spark.createDataFrame(data1)
check_columns_exits(df1.columns)
結果:
All required columns are present
Does not have all the required columns
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/377550.html
標籤:Python 验证 火花 apache-spark-sql
