- 我們有一個資料框,我們想將其作為拼花格式和覆寫模式寫入 s3。
- 每次我們撰寫資料框時,它總是一個新檔案夾。寫入s3位置的代碼如下:
df.write
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
.option("maxRecordsPerFile", maxRecordsPerFile)
.mode("overwrite")
.format(format)
.save(output)
我們觀察到的是,有時我們會得到FilenotFoundException(下面的完整跟蹤)。有人可以幫我理解嗎
- 當我寫入新的 s3 位置時(意味著沒有人從該位置讀取);為什么寫入程式會拋出以下例外?
- 如何解決?--我看到幾個 stackoverflows 指向這個例外。但是他們說,當您嘗試在寫入發生時進行讀取時,就會發生這種情況。但我的情況并非如此。當寫發生時我不讀。
- 我的火花是
2.3.2;EMR-5.18.1; 代碼寫在scala - 我正在使用
s3://作為輸出檔案夾路徑。我應該把它改成 somes3n還是s3a?那會有幫助嗎?
Caused by: java.io.FileNotFoundException: No such file or directory 's3://BUCKET/snapshots/FOLDER/_bid_9223370368440344985/part-00020-693dfbcb-74e9-45b0-b892-0b19fa92365c-c000.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:104)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:853)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
uj5u.com熱心網友回復:
我終于能夠解決問題
df : DataFrame形成在同一檔案夾中,該s3檔案夾正在以overwrite模式寫入。所以在
overwrite; 源檔案夾正在被清除——這導致了錯誤
希望這對某人有幫助。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/535942.html
