我真的需要一些幫助:
我們使用 Spark3.1.2 使用獨立集群。自從我們開始使用 s3a 目錄提交器以來,我們的 Spark 作業穩定性和性能顯著提高!
然而,最近幾天我們完全困惑于解決這個 s3a 目錄提交者問題,不知道你是否知道發生了什么?
由于 Java OOM(或者更確切地說是行程限制)錯誤,我們的 spark 作業失敗:
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
at java.base/java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:714)
at org.apache.spark.rpc.netty.DedicatedMessageLoop.$anonfun$new$1(MessageLoop.scala:174)
at org.apache.spark.rpc.netty.DedicatedMessageLoop.$anonfun$new$1$adapted(MessageLoop.scala:173)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at org.apache.spark.rpc.netty.DedicatedMessageLoop.<init>(MessageLoop.scala:173)
at org.apache.spark.rpc.netty.Dispatcher.liftedTree1$1(Dispatcher.scala:75)
at org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:72)
at org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:136)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:231)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:394)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Spark Thread Dump 在 Spark 驅動程式上顯示了超過 5000 個提交者執行緒!下面是一個例子:
Thread ID Thread Name Thread State Thread Locks
1047 s3-committer-pool-0 WAITING
1449 s3-committer-pool-0 WAITING
1468 s3-committer-pool-0 WAITING
1485 s3-committer-pool-0 WAITING
1505 s3-committer-pool-0 WAITING
1524 s3-committer-pool-0 WAITING
1529 s3-committer-pool-0 WAITING
1544 s3-committer-pool-0 WAITING
1549 s3-committer-pool-0 WAITING
1809 s3-committer-pool-0 WAITING
1972 s3-committer-pool-0 WAITING
1998 s3-committer-pool-0 WAITING
2022 s3-committer-pool-0 WAITING
2043 s3-committer-pool-0 WAITING
2416 s3-committer-pool-0 WAITING
2453 s3-committer-pool-0 WAITING
2470 s3-committer-pool-0 WAITING
2517 s3-committer-pool-0 WAITING
2534 s3-committer-pool-0 WAITING
2551 s3-committer-pool-0 WAITING
2580 s3-committer-pool-0 WAITING
2597 s3-committer-pool-0 WAITING
2614 s3-committer-pool-0 WAITING
2631 s3-committer-pool-0 WAITING
2726 s3-committer-pool-0 WAITING
2743 s3-committer-pool-0 WAITING
2763 s3-committer-pool-0 WAITING
2780 s3-committer-pool-0 WAITING
2819 s3-committer-pool-0 WAITING
2841 s3-committer-pool-0 WAITING
2858 s3-committer-pool-0 WAITING
2875 s3-committer-pool-0 WAITING
2925 s3-committer-pool-0 WAITING
2942 s3-committer-pool-0 WAITING
2963 s3-committer-pool-0 WAITING
2980 s3-committer-pool-0 WAITING
3020 s3-committer-pool-0 WAITING
3037 s3-committer-pool-0 WAITING
3055 s3-committer-pool-0 WAITING
3072 s3-committer-pool-0 WAITING
3127 s3-committer-pool-0 WAITING
3144 s3-committer-pool-0 WAITING
3163 s3-committer-pool-0 WAITING
3180 s3-committer-pool-0 WAITING
3222 s3-committer-pool-0 WAITING
3242 s3-committer-pool-0 WAITING
3259 s3-committer-pool-0 WAITING
3278 s3-committer-pool-0 WAITING
3418 s3-committer-pool-0 WAITING
3435 s3-committer-pool-0 WAITING
3452 s3-committer-pool-0 WAITING
3469 s3-committer-pool-0 WAITING
3486 s3-committer-pool-0 WAITING
3491 s3-committer-pool-0 WAITING
3501 s3-committer-pool-0 WAITING
3508 s3-committer-pool-0 WAITING
4029 s3-committer-pool-0 WAITING
4093 s3-committer-pool-0 WAITING
4658 s3-committer-pool-0 WAITING
4666 s3-committer-pool-0 WAITING
4907 s3-committer-pool-0 WAITING
5102 s3-committer-pool-0 WAITING
5119 s3-committer-pool-0 WAITING
5158 s3-committer-pool-0 WAITING
5175 s3-committer-pool-0 WAITING
5192 s3-committer-pool-0 WAITING
5209 s3-committer-pool-0 WAITING
5226 s3-committer-pool-0 WAITING
5395 s3-committer-pool-0 WAITING
5634 s3-committer-pool-0 WAITING
5651 s3-committer-pool-0 WAITING
5668 s3-committer-pool-0 WAITING
5685 s3-committer-pool-0 WAITING
5702 s3-committer-pool-0 WAITING
5722 s3-committer-pool-0 WAITING
5739 s3-committer-pool-0 WAITING
6144 s3-committer-pool-0 WAITING
6167 s3-committer-pool-0 WAITING
6289 s3-committer-pool-0 WAITING
6588 s3-committer-pool-0 WAITING
6628 s3-committer-pool-0 WAITING
6645 s3-committer-pool-0 WAITING
6662 s3-committer-pool-0 WAITING
6675 s3-committer-pool-0 WAITING
6692 s3-committer-pool-0 WAITING
6709 s3-committer-pool-0 WAITING
7049 s3-committer-pool-0 WAITING
這是考慮到我們的設定不允許超過100個執行緒……或者我們不明白什么……
這是我們的配置和設定:
fs.s3a.threads.max 100
fs.s3a.connection.maximum 1000
fs.s3a.committer.threads 16
fs.s3a.max.total.tasks 5
fs.s3a.committer.name directory
fs.s3a.fast.upload.buffer disk
io.file.buffer.size 1048576
mapreduce.outputcommitter.factory.scheme.s3a - org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
我們曾嘗試過不同版本的 spark Hadoop 云庫,但問題始終如一。
https://repository.cloudera.com/content/repositories/releases/org/apache/spark/spark-hadoop-cloud_2.11/2.4.0-cdh6.3.2/spark-hadoop-cloud_2.11-2.4.0-cdh6.3.2.jar
https://repository.cloudera.com/artifactory/libs-release-local/org/apache/spark/spark-hadoop-cloud_2.11/2.4.0.7.0.3.0-79/spark-hadoop-cloud_2.11-2.4.0.7.0.3.0-79.jar
https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.2.0/spark-hadoop-cloud_2.12-3.2.0.jar
https://repository.cloudera.com/artifactory/libs-release-local/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2.7.2.12.0-291/spark-hadoop-cloud_2.12-3.1.2.7.2.12.0-291.jar
We'd really appreciate if you can point us in the right direction ??
Thank you for your time!
uj5u.com熱心網友回復:
這將是HADOOP-16570 S3A 提交者泄漏執行緒/引發大規模作業/任務提交的 OOM
向上移動到 hadoop-3.3.0 二進制檔案以進行修復。理想情況下到 3.3.1 來解決一些其他問題,尤其是來自 spark 的重復作業 ID。不確定修復的 CDH 版本可以追溯到多遠;如果你現在真的需要,我可以解決。當然不是 CDH6.x
uj5u.com熱心網友回復:
查看這篇關于 S3A 調優的文章。
具體來說,即使它歸類在陣列下,我也會看一下:
您可能需要仔細調整以降低耗盡記憶體的風險,尤其是當資料緩沖在記憶體中時。有許多引數可以調整:
檔案系統中可用于資料上傳或任何其他排隊檔案系統操作的執行緒總數。這是在 fs.s3a.threads.max 中設定的。
可以排隊等待執行、等待執行緒的運算元。這是在 fs.s3a.max.total.tasks 中設定的。
單個輸出流可以處于活動狀態的塊數(即由執行緒上傳或在檔案系統執行緒佇列中排隊)。這是在 fs.s3a.fast.upload.active.blocks 中設定的。
空閑執行緒在退出之前可以在執行緒池中停留的時間長度。這是在 fs.s3a.threads.keepalivetime 中設定的。
我認為您可能會發現減少執行緒數會消除記憶體壓力。
我還建議您調整fs.s3a.fast.upload.active.blocks它也可以減輕記憶體壓力。我認為減少執行緒數應該是你的第一步,因為 100 有點激進。您可能會受到帶寬限制,并且額外的執行緒除了消耗記憶體之外不可能做任何事情。
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/368857.html
標籤:java apache-spark hadoop amazon-s3 pyspark
