我有以下格式的json:
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}
它的型別: pyspark.sql.dataframe.DataFrame
如何將此 json 檔案拆分為多個 json 檔案并將其保存在使用的year目錄中Pyspark?喜歡:
目錄: path.../2020/<all split json files>
Apple.json
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"4", "fruit": "Apple","cost": "400" }
Kiwi.json
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"6", "fruit": "Kiwi", "cost": "600"}
Mango.json
{"year":"2020", "id":"5", "fruit": "Mango", "cost": "500"}
Cherry.json
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
另外,如果我遇到不同的年份,如何以類似的方式推送檔案:path.../2021/<all split json files>?
最初我嘗試找到所有獨特的水果并創建一個串列。然后嘗試創建多個資料框并將 json 值推入其中。然后將每個資料幀轉換為 json 格式。但我發現這效率低下。然后我也檢查了這個鏈接。但這里的問題是它以 dict 形式創建了一個鍵值對,這略有不同。
然后我也了解了 Pyspark groupBy 方法。這似乎是有道理的,因為我可以 groupBy() 水果值,然后拆分 json 檔案,但我覺得我錯過了一些東西。
uj5u.com熱心網友回復:
以以下 JSON 為例
{"year":"2020", "id":"1", "fruit":"Apple","cost": "100" }
{"year":"2020", "id":"2", "fruit":"Kiwi", "cost": "200"}
{"year":"2020", "id":"3", "fruit":"Cherry", "cost": "300"}
{"year":"2021", "id":"10", "fruit": "Pear","cost": "1000" }
{"year":"2021", "id":"11", "fruit": "Mango", "cost": "1100"}
{"year":"2021", "id":"12", "fruit": "Banana", "cost": "1200"}
您可以使用和partitionBy對資料進行磁區。請注意,我創建了 year 列的副本,因為當您將資料寫入磁盤時,磁區所在的列會被洗掉。yearfruit
df = spark.read.json("./ex.json")
df = df.withColumn("Year", df["year"])
df = df.withColumn("Fruit", df["fruit"])
df.write.partitionBy("Year", "Fruit").json("result")
這將產生一個名為RESULT以下結構的檔案夾。
|-- RESULT
| |-- Year=2020
| | |-- Fruit=Apple
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Cherry
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Kiwi
| | | |-- part0000-dcea0683...json
| |-- Year=2021
| | |-- Fruit=Banana
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Mango
| | | |-- part0000-dcea0683...json
| | |-- Fruit=Pear
| | | |-- part0000-dcea0683...json
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/415702.html
標籤:
