如何使用Pyspark加載復雜資料-有解無憂

我有一個CSV dataset如下所示：

如何使用 Pyspark 加載復雜資料

此外，文本形式的 PFB 資料：

Timestamp,How old are you?,What industry do you work in?,Job title,What is your annual salary?,Please indicate the currency,Where are you located? (City/state/country),How many years of post-college professional work experience do you have?,"If your job title needs additional context, please clarify here:","If ""Other,"" please indicate the currency here: "
4/24/2019 11:43:21,35-44,Government,Talent Management Asst. Director,75000,USD,"Nashville, TN",11 - 20 years,,
4/24/2019 11:43:26,25-34,Environmental nonprofit,Operations Director,"65,000",USD,"Madison, Wi",8 - 10 years,,
4/24/2019 11:43:27,18-24,Market Research,Market Research Assistant,"36,330",USD,"Las Vegas, NV",2 - 4 years,,
4/24/2019 11:43:27,25-34,Biotechnology,Senior Scientist,34600,GBP,"Cardiff, UK",5-7 years,,
4/24/2019 11:43:29,25-34,Healthcare,Social worker (embedded in primary care),55000,USD,"Southeast Michigan, USA",5-7 years,,
4/24/2019 11:43:29,25-34,Information Management,Associate Consultant,"45,000",USD,"Seattle, WA",8 - 10 years,,
4/24/2019 11:43:30,25-34,Nonprofit ,Development Manager ,"51,000",USD,"Dallas, Texas, United States",2 - 4 years,"I manage our fundraising department, primarily overseeing our direct mail, planned giving, and grant writing programs. ",
4/24/2019 11:43:30,25-34,Higher Education,Student Records Coordinator,"54,371",USD,Philadelphia,8 - 10 years,equivalent to Assistant Registrar,  
4/25/2019 8:35:51,25-34,Marketing,Associate Product Manager,"43,000",USD,"Cincinnati, OH, USA",5-7 years,"I started as the Marketing Coordinator, and was given the ""Associate Product Manager"" title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.",

現在，我嘗試了以下代碼來加載資料：

df = spark.read.option("header", "true").option("multiline", "true").option(
    "delimiter", ",").csv("path")

它為我提供了如下輸出，用于劃分列的最后一條記錄，并且輸出也不如預期：

如何使用 Pyspark 加載復雜資料

最后一列的值應該為空"If ""Other,"" please indicate the currency here: "，即整個字串應該包含在前面的列中，即"If your job title needs additional context, please clarify here:"

我也嘗試過.option('quote','/"').option('escape','/"')，但也沒有用。

但是，當我嘗試使用加載此檔案時Pandas，它已正確加載。我很驚訝如何Pandas確定新列名的開始位置以及所有內容。也許我可以String schema為所有列定義 a 并將其加載回 spark 資料幀，但由于我使用的是較低的 spark 版本，它不會以分布式方式作業，因此我正在探索一種 Spark 可以如何有效處理此問題的方法。

任何幫助深表感謝。

uj5u.com熱心網友回復：

主要問題是 csv 檔案中的連續雙引號。您必須在 csv 檔案中轉義額外的雙引號，如下所示：

4/24/2019 11:43:30,25-34,Higher Education,Student Records Coordinator,"54,371",USD,Philadelphia,8 - 10 years,equivalent to Assistant Registrar,  
4/25/2019 8:35:51,25-34,Marketing,Associate Product Manager,"43,000",USD,"Cincinnati, OH, USA",5-7 years,"I started as the Marketing Coordinator, and was given the  \\" \ " Associate Product Manager \\" \ " title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.",

在此之后，它會按預期生成結果：

df2 = spark.read.option("header",True).csv("sample1.csv")

df2.show(10,truncate=False)

******** 輸出 ********

|4/25/2019 8:35:51 |25-34           |Marketing                    |Associate Product Manager               |43,000                     |USD                         |Cincinnati, OH, USA                        |5-7 years                                                               |I started as the Marketing Coordinator, and was given the ""Associate Product Manager"" title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.|null       |null                                   |

或者你可以使用打擊代碼

df2 = spark.read.option("header",True).option("multiline","true").option("escape","\"").csv("sample1.csv")

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/481485.html

標籤：Python 熊猫阿帕奇火花 pyspark

上一篇：如何將這三個正則運算式組合成一個？

下一篇：如何在pysaprk資料幀中保持資料在一定范圍內的唯一性？