如何讀取多個CSV檔案并將它們合并到PySpark中的單個資料框中-有解無憂

我有 4 個不同列的 CSV 檔案。一些 CSV 也具有相同的列名。csv的詳細資訊是：

capstone_customers.csv: [customer_id, customer_type, repeat_customer]

capstone_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

capstone_recent_customers.csv: [customer_id, customer_type]

capstone_recent_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

我的代碼是：

df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
    df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
    df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
    df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")


    from functools import reduce
    def unite_dfs(df1, df2):
      return df2.union(df1)
    
    list_of_dfs = [df1, df2,df3,df4]
    united_df = reduce(unite_dfs, list_of_dfs)

但我得到了錯誤：

Union只能對列數相同的表進行，但是第一個表6列，第二個表3列；;\n'Union\n:- Relation[invoice_id#234,product_id#235,customer_id# 236,days_until_shipped#237,product_line#238,total#239] csv\n - 關系[customer_id#218,customer_type#219,repeat_customer#220] csv\n

如何在單個資料框中合并并使用 PySpark 洗掉相同的列名？

uj5u.com熱心網友回復：

要在 Shark 中讀取多個檔案，您可以列出您想要的所有檔案并一次讀取它們，您不必按順序讀取它們。

這是您可以使用的代碼示例：

path = ['file.cvs','file.cvs']
 
df = spark.read.options(header=True).csv(path)
df.show()

uj5u.com熱心網友回復：

您可以提供要讀取的檔案串列或檔案路徑，而不是一一讀取。不要忘記mergeSchema選項：

files = [
   "capstone_customers.csv",
   "capstone_invoices.csv",
   "capstone_recent_customers.csv",
   "capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)

# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/515559.html

標籤：PythonCSV阿帕奇火花pyspark

上一篇：如何在PowerShell中拆分最后一個字串

下一篇：POWERSHELL-僅從CSV中洗掉一列的引號