前面一篇文章提到大資料開發-Spark Join原理詳解,本文從原始碼角度來看cogroup 的join實作
1.分析下面的代碼
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object JoinDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val random = scala.util.Random
val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx"))
val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD"))
val rdd1: RDD[(Int, String)] = sc.makeRDD(col1)
val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)
val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2)
println(rdd3.dependencies)
val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
println(rdd4.dependencies)
sc.stop()
}
}
分析上面一段代碼,列印結果是什么,這種join是寬依賴還是窄依賴,為什么是這樣
2.從spark的ui界面來查看運行情況
關于stage劃分和寬依賴窄依賴的關系,從2.1.3 如何區別寬依賴和窄依賴就知道stage與寬依賴對應,所以從rdd3和rdd4的stage的依賴圖就可以區別寬依賴,可以看到join劃分除了新的stage,所以rdd3的生成事寬依賴,另外rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3))) 是另外的依賴圖,所以可以看到partitionBy以后再沒有劃分新的 stage,所以是窄依賴,
3.join的原始碼實作
前面知道結論,是從ui圖里面看到的,現在看join原始碼是如何實作的(基于spark2.4.5)
先進去入口方法,其中withScope的做法可以理解為裝飾器,為了在sparkUI中能展示更多的資訊,所以把所有創建的RDD的方法都包裹起來,同時用RDDOperationScope 記錄 RDD 的操作歷史和關聯,就能達成目標,
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Performs a hash join across the cluster.
*/
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
join(other, defaultPartitioner(self, other))
}
下面來看defaultPartitioner 的實作,其目的就是在默認值和磁區器之間取一個較大的,回傳磁區器
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
// 判斷有沒有設定磁區器partitioner
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
//如果設定了partitioner,則取設定partitioner的最大磁區數
val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
Some(hasPartitioner.maxBy(_.partitions.length))
} else {
None
}
//判斷是否設定了spark.default.parallelism,如果設定了則取spark.default.parallelism
val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
rdd.context.defaultParallelism
} else {
rdds.map(_.partitions.length).max
}
// If the existing max partitioner is an eligible one, or its partitions number is larger
// than the default number of partitions, use the existing partitioner.
//主要判斷傳入rdd是否設定了默認的partitioner 以及設定的partitioner是否合法
//或者設定的partitioner磁區數大于默認的磁區數
//條件成立則取傳入rdd最大的磁區數,否則取默認的磁區數
if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
hasMaxPartitioner.get.partitioner.get
} else {
new HashPartitioner(defaultNumPartitions)
}
}
private def isEligiblePartitioner(
hasMaxPartitioner: RDD[_],
rdds: Seq[RDD[_]]): Boolean = {
val maxPartitions = rdds.map(_.partitions.length).max
log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1
}
}
再進入join的多載方法,里面有個new CoGroupedRDD[K](Seq(self, other), partitioner)
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
//partitioner 通過對比得到的默認磁區器,主要是磁區器中的磁區數
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Performs a hash join across the cluster.
*/
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
join(other, new HashPartitioner(numPartitions))
}
最后來看CoGroupedRDD,這是決定是寬依賴還是窄依賴的地方,可以看到如果左邊rdd的磁區和上面選擇給定的磁區器一致,則認為是窄依賴,否則是寬依賴
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
總結,join時候可以指定磁區數,如果join操作左右的rdd的磁區方式和磁區數一致則不會產生shuffle,否則就會shuffle,而是寬依賴,磁區方式和磁區數的體現就是磁區器,
吳邪,小三爺,混跡于后臺,大資料,人工智能領域的小菜鳥,
更多請關注

轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/259111.html
標籤:其他
上一篇:git-基操
