我有一個 csv 檔案,其中的記錄在列(COMMENT)中有一個換行符。當我使用 pyspark 讀取檔案時,記錄跨度為多行(3)。即使使用 multiLine 選項,它也不起作用。下面是我的代碼
spark = SparkSession.builder.appName("ProviderAnalysis").master("local[*]").getOrCreate()
provider_df = (
spark.read.csv("./sample_data/pp.csv",header=True , inferSchema=True,multiLine=True)
)
以下是帶有換行符的csv檔案的記錄
"A","B","C","D","COMMENTS","E","F","G","H","I","J","K","L","M","N","O","Q","R","S","T","U","V","X","Y","Z","AA","AB","AC","AD","AE","AF","AG","AH"
1,"S","S","R","Pxxxx xxx xxxx. xxxxx xxx ""xxxxx xxx xxxx."" xx xxx xxx xx xxx xxxxx xxxx xx 10/27/24.
xxx xxxxx xxxxx xxxx xxxxx xx 6/30/29 -yyy
10/26/2018 fffff ffffff ff: fffffff-ff","fff",,"","fff","ff","","f","","1","1","","",,"1","","","","","","","","","f","",5,"ffff","",""
如果我在 LibreOffice Calc 應用程式中打開檔案,它會顯示為一條記錄,但是 pypsark 將其讀取為 3 行
有沒有人遇到過這個問題和/或任何人都可以幫助我解決這個問題。謝謝
uj5u.com熱心網友回復:
嘗試添加escape選項。雙引號列(COMMENTS)中有雙引號,因此需要轉義里面的雙引號。
provider_df = spark.read.csv("./sample_data/pp.csv",
header=True,
inferSchema=True,
multiLine=True,
escape='"') # <-- added
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/441866.html
