保存Parquet檔案時_temporary/0目錄上的FileNotFoundException-有解無憂

在 Azure HDInsight 群集上使用 Python，我們使用以下代碼將 Spark 資料幀作為 Parquet 檔案保存到 Azure Data Lake Storage Gen2：

df.write.parquet('abfs://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath, 'overwrite', compression='snappy')

這通常是有效的，但是當我們最近升級我們的集群以同時運行更多腳本時（大約 10 到 15 個），對于不同的一小部分腳本，我們始終會遇到以下例外：

Py4JJavaError：呼叫 o2232.parquet 時出錯。：java.io.FileNotFoundException：操作失敗：“指定的路徑不存在。”，404，PUT，https: //my_storage_account.dfs.core.windows.net/mypath/_temporary/0 ? resource = directory & timeout =90， PathNotFound, "指定的路徑不存在。"

我認為所有 Spark 作業和任務實際上都成功了，也是保存表的作業和任務，但是 Python 腳本以例外形式退出。

背景資料

我們使用的是 Spark 2.4.5.4.1.1.2。使用 Scala 版本 2.11.12、OpenJDK 64 位服務器 VM、1.8.0_265、Hadoop 3.1.2.4.1.1.2

堆疊跟蹤：

  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 843, in parquet
    df_to_save.write.parquet(blob_path, mode, compression='snappy')
    self._jwrite.parquet(path)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o2232.parquet.
: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, PUT, https://my_dwh_container@my_storage_account.dfs.core.windows.net/mypath/_temporary/0?resource=directory&timeout=90, PathNotFound, "The specified path does not exist. RequestId:1870ec49-e01f-0101-72f8-f260fe000000 Time:2021-12-17T03:42:35.8434071Z"
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1178)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:477)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2288)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:382)
    at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:162)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:139)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

日志：

21/12/17 03:42:02 INFO DAGScheduler [Thread-11]: Job 2 finished: saveAsTable at NativeMethodAccessorImpl.java:0, took 1.120535 s
21/12/17 03:42:02 INFO FileFormatWriter [Thread-11]: Write Job 11fc45a5-d398-4f9a-8350-f928c3722886 committed.
21/12/17 03:42:02 INFO FileFormatWriter [Thread-11]: Finished processing stats for write job 11fc45a5-d398-4f9a-8350-f928c3722886.
(...)
21/12/17 03:42:05 INFO ParquetFileFormat [Thread-11]: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
21/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: File Output Committer Algorithm version is 2
21/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false, move _temporary folders into Trash: false
21/12/17 03:42:05 INFO SQLHadoopMapReduceCommitProtocol [Thread-11]: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
21/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: File Output Committer Algorithm version is 2
21/12/17 03:42:05 INFO FileOutputCommitter [Thread-11]: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false, move _temporary folders into Trash: false
21/12/17 03:42:05 INFO SQLHadoopMapReduceCommitProtocol [Thread-11]: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
21/12/17 03:42:28 ERROR ApplicationMaster [Driver]: User application exited with status 1
21/12/17 03:42:28 INFO ApplicationMaster [Driver]: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)

這個例外還有另一個版本，它確實發生在 Spark 任務中，然后失敗，但 Spark 會自動重新啟動失敗的任務，然后通常會成功。在某些情況下，AM 會報告應用程式失敗，但我不明白為什么，因為所有作業都成功了。

可能的原因

As seen in Spark _temporary creation reason I would expect that the _temporary directory will not be moved until all tasks are done. Looking at the stacktrace, it happens in AzureBlobFileSystem.mkdirs, which suggests to me that it's trying to create subdirectories somewhere under _temporary/0, but it cannot find the 0 directory. I'm not sure if the _temporary directory exists at that point.

Related questions

https://issues.apache.org/jira/browse/SPARK-2984 It does sound similar, but I don't see tasks being restarted because they take long, and this should have been fixed a long time ago anyway. I'm not completely sure if speculative execution is visible in the Spark UI though.
Saving dataframe to local file system results in empty results We are not saving to any local file system (even though the error message does say https, the stacktrace shows AzureBlobFileSystem.
Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode We are using HDFS and also file output committer 2
Multiple spark jobs appending parquet data to same base path with partitioning I don't think two jobs make use of the same directory here
https://community.datastax.com/questions/3082/while-writing-to-parquet-file-on-hdfs-throwing-fil.html I don't think this is a permissions issue, as most of the time it does work.
Extremely slow S3 write times from EMR/ Spark We don't have any problems regarding slow renaming, as far as I know (the files aren't very large anyway). I think it fails before renaming, so a zero-rename committer wouldn't help here?
https://support.huaweicloud.com/intl/en-us/trouble-mrs/mrs_03_0084.html Suggests to look in the namenode audit log of hdfs, but haven't yet found it.
https://github.com/apache/hadoop/blob/b7d2135f6f5cea7cf5d5fc5a2090fc5d8596969e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L355 Since the stacktrace shows it fails at mkdirs, I'm guessing the _temporary itself doesn't exist, but I don't understand why mkdirs doesn't create it. But I don't think AzureBlobFileSystem is open source?
I did find some version of AzureBlobFileSystem.java but based on the stacktrace it would go to checkException with a PATH_ALREADY_EXISTS flag which doesn't make sense to me.

Possible options to try:

Pyspark 資料幀寫入拼花而不洗掉 /_temporary 檔案夾我們可以嘗試的是，首先保存到不同的 HDFS，然后復制最終檔案。我不確定它為什么會有幫助，因為我們已經在保存到 HDFS（嗯，它的擴展，ADFS）。
https://community.cloudera.com/t5/Support-Questions/How-to-change-Spark-temporary-directory-when-writing-data/td-p/237389我們可以嘗試自己使用追加和洗掉檔案。
更改 spark _temporary 目錄路徑使用我們自己的 FileOutputCommitter 對這個問題聽起來有點矯枉過正

uj5u.com熱心網友回復：

ABFS 是一個“真正的”檔案系統，因此不需要 S3A 零重命名提交者。確實，它們不會起作用。并且客戶端是完全開源的 - 查看hadoop-azure模塊。

ADLS gen2 存盤確實存在規模問題，但除非您嘗試提交 10,000 個檔案，或清理大量深層目錄樹，否則您不會遇到這些問題。如果您確實收到有關 Elliott 重命名單個檔案的錯誤訊息，并且您正在執行這種規模的作業 (a) 與 Microsoft 討論增加分配的容量，并且 (b) 選擇https://github.com/apache/hadoop/拉/2971

這不是。我猜想實際上您有多個作業寫入相同的輸出路徑，一個正在清理，而另一個正在設定。特別是 - 他們似乎都有一個“0”的作業 ID。因為使用了相同的作業 ID，所以只有當任務設定和任務清理混淆時，才有可能當一個作業提交時，它包含來自所有已成功提交的任務嘗試的作業 2 的輸出。

我相信這是 Spark 獨立部署的一個已知問題，盡管我找不到相關的 JIRA。SPARK-24552很接近，但應該已在您的版本中修復。SPARK-33402 在同一秒啟動的作業具有重復的 MapReduce JobID。那是關于僅來自系統當前時間的作業 ID，而不是 0。但是：您可以嘗試升級您的 Spark 版本以查看它是否消失。

我的建議

確保您的作業不會同時寫入同一個表。事情會變得一團糟。
獲取您滿意的最新版本 spark

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/389724.html

標籤：Python 阿帕奇火花 Hadoop 高密度文件 azure-blob-storage

上一篇：模擬的AngularnavigateByUrl仍在運行頁面重新加載

下一篇：如何在Scala中創建鑲木地板？