鄙人剛開始接觸spark大資料處理,最近在做一些資料挖掘的實驗,其中一個對比演算法借用了spark庫中的經典資料挖掘演算法FP-Growth,我這里的集群是三臺機器:1*master(2cores,4G mem), 2*worker(4cores,8G mem)。在進行的實驗中挖掘的資料集是利用IBM的資料產生器生成的T40I10D100K.txt,大小14.76 MB。因為該資料集中重復出現的數字較少,所以支持度設為了1%。磁區數為16。結果挖掘程序中,出現了堆溢位的錯誤,應該是由于growth程序中遞回建立條件FP-Tree。這里我就產生了一個疑問,我的集群兩臺worker加起來可用記憶體也有12GB的樣子,怎么會連15MB的資料都處理不了。是因為支持度過低,還是因為FP-Growth進行頻繁項集的挖掘勢必會造成這么大的記憶體消耗?
請各位有經驗的前輩賜教,謝謝。
uj5u.com熱心網友回復:
這是呼叫FP-Growth的原始碼object PFP {
def main(args: Array[String]): Unit = {
val texts = mutable.Map(
// "T25I10D10K.txt"->List(0.005,0.004,0.003,0.002),
// "mushroom.txt"->List(0.01))
// "chess.txt"->List(0.4))
// "accidents.txt"->List(0.1))
// "T10I4D100K.txt"->List(0.005,0.004,0.003,0.002,0.001),
//"T40I10D100K.txt"->List(0.01))
"connect-4.txt"->List(0.3))
// "kddcup99.txt"->List(0.0001,0.00009,0.00008,0.00007,0.00006))
//"USCensus.txt"->List(0.5))
// "connect-4.txt"->List(0.5))
val conf =new SparkConf().setAppName("PFP_scala")
val sc =new SparkContext(conf)
texts.foreach{ text =>
val writer =new PrintWriter(new File("/root/app/scala2.10/PFP/"+text._1))
val data=https://bbs.csdn.net/topics/ sc.textFile("/usr/local/eclipsews/"+text._1)
val startTime = System.currentTimeMillis()
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
val ioTime = System.currentTimeMillis() - startTime
text._2.foreach{ support =>
for(i<-0 to 0){
val time1 = System.currentTimeMillis()
val fpg = new FPGrowth()
.setMinSupport(support)
.setNumPartitions(16)
val model = fpg.run(transactions)
val process = java.lang.Runtime.getRuntime.exec("/root/app/scala2.10/PFP/checkHDFS.sh")
process.waitFor();
model.freqItemsets.saveAsTextFile("/usr/local/PFP")
val endTime = System.currentTimeMillis()
val mineTime =endTime - time1
//hehe.foreach { itemset =>println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)}
val time = mineTime + ioTime
writer.write("database: "+text._1+" support: "+support+ " iotime: " +ioTime +" mineTime "+ mineTime+ " time: "+ time+ "\n")
}
}
writer.close()
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/56409.html
標籤:Spark
