Spark運行卡住，求助一下各位大神-有解無憂

我的集群配置為7臺，其中5臺機子都是8g記憶體，另外兩臺為虛擬機。
在別寫程式之后通過spark-submit進行提交，可以成功跑完。但是今天在進行重跑的時候出現了一個問題，問題如下：
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_1730_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_1732_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
17/03/26 10:10:32 INFO BlockManagerInfo: Added broadcast_1733_piece0 in memory on 196.168.168.104:59612 (size: 119.0 B, free: 3.0 GB)
一直卡在這個地方，我嘗試過很多方法都沒辦法解決，可能不知道出現的原因，所以需要哪位大神看看，給點建議，謝謝了

uj5u.com熱心網友回復：

通過查看日志發現
17/03/25 22:52:32 INFO ExternalSorter: Thread 82 spilling in-memory map of 473.6 MB to disk (25 times so far)
17/03/25 22:52:37 INFO ExternalSorter: Thread 71 spilling in-memory map of 392.0 MB to disk (26 times so far)
17/03/25 22:52:52 INFO ExternalSorter: Thread 80 spilling in-memory map of 392.0 MB to disk (22 times so far)
17/03/25 22:53:07 INFO ExternalSorter: Thread 70 spilling in-memory map of 392.0 MB to disk (24 times so far)
17/03/25 22:53:38 INFO ExternalSorter: Thread 79 spilling in-memory map of 401.9 MB to disk (27 times so far)
17/03/25 22:53:49 INFO ExternalSorter: Thread 83 spilling in-memory map of 416.0 MB to disk (24 times so far)
17/03/25 22:53:53 INFO ExternalSorter: Thread 82 spilling in-memory map of 396.8 MB to disk (26 times so far)
不知道怎么解決？

uj5u.com熱心網友回復：

我也遇到這個問題，在這一步卡了很長時間。。樓主解決了嗎

uj5u.com熱心網友回復：

還沒呢，一直沒解決，好疑惑，需要大神來拯救一把

uj5u.com熱心網友回復：

樓主提供的資訊不全啊。
首先spark版本號，應用代碼，卡在哪個task，記憶體配置情況？
看樣子像是記憶體不足頻繁寫磁盤造成的。

uj5u.com熱心網友回復：

spark 2.1
下面是sumbit提交的內容
/root/spark-2.1.0-bin-hadoop2.6/bin/spark-submit \
--class com.sirc.zwz.CSRJava.ChangeDataStruction.SCSR \
--num-executors 100 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 8 \
/root/jars/SparkCSR_JAVA-0.0.1-SNAPSHOT.jar \
7臺集群，1臺master，6臺slave，其中4臺各8g記憶體，可提供spark運行的最大記憶體為6g（每臺），另外2臺是虛擬機各2g記憶體，各提供1g進行計算

下面是部分日志資訊：
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW]]}
2017-03-28 11:21:23,382 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000017_21/part-00017. BP-2089499914-196.168.168.100-1490492430641 blk_1073742807_1983{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW]]}
2017-03-28 11:21:23,459 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000029_33/part-00029. BP-2089499914-196.168.168.100-1490492430641 blk_1073742808_1984{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW], ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW]]}
2017-03-28 11:21:23,509 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000026_30/part-00026. BP-2089499914-196.168.168.100-1490492430641 blk_1073742809_1985{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-97e7de7f-fbcd-44bb-821d-4d245f1ce82c:NORMAL:196.168.168.101:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW]]}
2017-03-28 11:21:23,513 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000009_13/part-00009. BP-2089499914-196.168.168.100-1490492430641 blk_1073742810_1986{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW], ReplicaUnderConstruction[[DISK]DS-99ba79bc-da18-4d0d-9a2c-b7b367cbea66:NORMAL:196.168.168.29:50010|RBW]]}
2017-03-28 11:21:23,521 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170327232029_0002_m_000013_17/part-00013. BP-2089499914-196.168.168.100-1490492430641 blk_1073742811_1987{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6eb004b2-b3dc-42df-b212-ffa2fd6b5572:NORMAL:196.168.168.27:50010|RBW]]}
2017-03-28 11:21:24,090 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /SRResult/N25E118/_temporary/0/_temporary/attempt_20170328112029_0002_m_000019_23/part-00019. BP-2089499914-196.168.168.100-1490492430641 blk_1073742812_1988{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-017dbc57-6553-43ea-8a2d-3555fccd663d:NORMAL:196.168.168.103:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5785ace1-a611-479b-b360-79562081feb1:NORMAL:196.168.168.104:50010|RBW], ReplicaUnderConstruction[[DISK]DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da:NORMAL:196.168.168.102:50010|RBW]]}
2017-03-28 11:27:42,734 INFO BlockStateChange: BLOCK* processReport: from storage DS-411c0e4c-86c5-4203-94d8-d6d7a95df7da node DatanodeRegistration(196.168.168.102, datanodeUuid=5407fb12-70a4-48d2-ac27-813a7833434c, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-18972982-c034-4dd4-b10b-d6563325e4cb;nsid=220744474;c=0), blocks: 20, hasStaleStorages: false, processing time: 1 msecs
2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 196.168.168.100
2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 7487
2017-03-28 12:15:58,137 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 360 Total time for transactions(ms): 33 Number of transactions batched in Syncs: 17 Number of syncs: 147 SyncTimes(ms): 1310 712
2017-03-28 12:15:58,160 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 360 Total time for transactions(ms): 33 Number of transactions batched in Syncs: 17 Number of syncs: 148 SyncTimes(ms): 1328 716
2017-03-28 12:15:58,161 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name1/current/edits_inprogress_0000000000000007487 -> /root/hadoop/hadoop-2.6.5/name1/current/edits_0000000000000007487-0000000000000007846
2017-03-28 12:15:58,161 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name2/current/edits_inprogress_0000000000000007487 -> /root/hadoop/hadoop-2.6.5/name2/current/edits_0000000000000007487-0000000000000007846
2017-03-28 12:15:58,161 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 7847
2017-03-28 12:15:58,551 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.08s at 139.24 KB/s
2017-03-28 12:15:58,551 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000007846 size 12281 bytes.
2017-03-28 12:15:58,618 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 7486
2017-03-28 12:15:58,618 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name1/current/fsimage_0000000000000007447, cpktTxId=0000000000000007447)
2017-03-28 12:15:58,619 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name2/current/fsimage_0000000000000007447, cpktTxId=0000000000000007447)
2017-03-28 12:48:56,541 INFO BlockStateChange: BLOCK* processReport: from storage DS-99ba79bc-da18-4d0d-9a2c-b7b367cbea66 node DatanodeRegistration(196.168.168.29, datanodeUuid=9efd8c9e-162c-4c45-af71-bf33f49ad408, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-18972982-c034-4dd4-b10b-d6563325e4cb;nsid=220744474;c=0), blocks: 13, hasStaleStorages: false, processing time: 1 msecs
2017-03-28 13:15:58,890 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 196.168.168.100
2017-03-28 13:15:58,890 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2017-03-28 13:15:58,890 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 7847
2017-03-28 13:15:58,891 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 70 46
2017-03-28 13:15:58,948 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 3 SyncTimes(ms): 105 68
2017-03-28 13:15:58,949 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name1/current/edits_inprogress_0000000000000007847 -> /root/hadoop/hadoop-2.6.5/name1/current/edits_0000000000000007847-0000000000000007848
2017-03-28 13:15:58,950 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop/hadoop-2.6.5/name2/current/edits_inprogress_0000000000000007847 -> /root/hadoop/hadoop-2.6.5/name2/current/edits_0000000000000007847-0000000000000007848
2017-03-28 13:15:58,951 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 7849
2017-03-28 13:15:59,856 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.20s at 55.28 KB/s
2017-03-28 13:15:59,856 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000007848 size 12281 bytes.
2017-03-28 13:16:00,041 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 7846
2017-03-28 13:16:00,041 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name1/current/fsimage_0000000000000007486, cpktTxId=0000000000000007486)
2017-03-28 13:16:00,041 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/root/hadoop/hadoop-2.6.5/name2/current/fsimage_0000000000000007486, cpktTxId=0000000000000007486)
2017-03-28 13:43:11,290 INFO logs: Aliases are enabled

uj5u.com熱心網友回復：

在資料量小的情況下可以很快跑完，但是資料量一大就卡了，卡住這個點可能數個小時，記憶體應該是足夠的，因為之前也成功跑過幾次

uj5u.com熱心網友回復：

我還發現，程式剛開始跑的使用CPU的占用率比較正常，維持在1~30%左右，一旦到卡住的地方，占有率爆炸式增長，最高可達到790%，很有可能就是記憶體的問題，求大神們指點指點

uj5u.com熱心網友回復：

我覺得可能是你的虛機造成的，盡管你指定executor-memory為6G，但是你虛機實際上只能有1G。當你的計算需要大量記憶體時，在虛機上就只能不停的溢寫了。
從網頁的4040埠可以看（有可能是404X），到底卡在哪個任務哪個executor。

uj5u.com熱心網友回復：

我嘗試的將兩臺虛擬機關閉，再跑一次，還是遇到同樣的問題，卡在一個stages上，這個stages是執行saveAsTextFile的。

uj5u.com熱心網友回復：

1. 因為一臺機器的記憶體分配給越多的executor，每個executor的記憶體就越小，以致出現過多的資料spill over甚至out of memory的情況。
2. 把這個引數調大些試試:spark.shuffle.memoryFraction
* 引數說明：該引數用于設定shuffle程序中一個task拉取到上個stage的task的輸出后，進行聚合操作時能夠使用的Executor記憶體的比例，默認是0.2。也就是說，Executor默認只有20%的記憶體用來進行該操作。shuffle操作在進行聚合時，如果發現使用的記憶體超出了這個20%的限制，那么多余的資料就會溢寫到磁盤檔案中去，此時就會極大地降低性能。
* 引數調優建議：如果Spark作業中的RDD持久化操作較少，shuffle操作較多時，建議降低持久化操作的記憶體占比，提高shuffle操作的記憶體占比比例，避免shuffle程序中資料過多時記憶體不夠用，必須溢寫到磁盤上，降低了性能。此外，如果發現作業由于頻繁的gc導致運行緩慢，意味著task執行用戶代碼的記憶體不夠用，那么同樣建議調低這個引數的值。

uj5u.com熱心網友回復：

SparkConf sc = new SparkConf().setAppName("SparkCalculateSR").set("spark.storage.memoryFraction", "0.2")
.set("spark.default.parallelism", "20")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.shuffle.consolidateFiles", "true").set("spark.reducer.maxSizeInFlight", "100m")
.set("spark.shuffle.file.buffer", "100k").set("spark.shuffle.io.maxRetries", "10")
.set("spark.shuffle.io.retryWait", "10s");
我設定了這些引數，在添加記憶體之后可以跑完，但是很慢很慢很慢，無法忍受，請大神再指點指點

uj5u.com熱心網友回復：

你上邊設定的引數可以提高shuffle的穩定性,所以是跑成功了.如果要增大shuffle使用executor記憶體可以調下邊兩個引數
num-executors 100 --這個調小
spark.shuffle.memoryFraction --這個調大
不知道你具體慢在哪了,所以沒法給你具體的優化建議.你采用的是hashshuffle嗎? consolidateFiles這個引數是hashshuffle的時候用的,要不改成SortShuffle試試,一般慢都慢在shuffle上了

uj5u.com熱心網友回復：

我在知乎上也進行了提問，并提供了原始碼，麻煩大神看看
https://www.zhihu.com/question/57772280?guide=1

uj5u.com熱心網友回復：

為啥不在csdn貼圖呢...兩個地方來回切....
我看了下ui截圖,感覺和shuffle無關,沒有資料傾斜,是不是就是資料量大的,資源不足的原因啊.
你要分析下到底卡在哪個stage了,然后才能具體的分析哪塊代碼效率不高啊

uj5u.com熱心網友回復：

--num-executors 100 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 8 \

100個executors 一個executor-memory 6G記憶體 8核cpu 那得多少記憶體多少cpu啊

uj5u.com熱心網友回復：

參考 15 樓 javahuoshan 的回復:

--num-executors 100 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 8 \

100個executors 一個executor-memory 6G記憶體 8核cpu 那得多少記憶體多少cpu啊

600g記憶體，800個核，集群資源遠遠不夠啊

uj5u.com熱心網友回復：

https://blog.csdn.net/lingbo229/article/details/80914283

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/49074.html

標籤：Spark

上一篇：兩臺服務器之間無法訪問（大神求助）

下一篇：阿里云騰訊云的登陸宕機了，對于用戶來說該從哪里改進，對于阿里和企鵝來從技術上怎么防止這種事情再發生？