當我閱讀有關排序合并連接的文章時,它說這是廣播連接之后 Spark 中最受歡迎的一種,但前提是連接鍵是可排序的。我的問題是什么時候加入鍵是不可排序的?任何資料型別都可以排序。你能幫我理解一個鍵可能無法排序的場景嗎?
uj5u.com熱心網友回復:
請參閱https://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read。優秀的網站。
并非所有型別都可以排序。例如日歷間隔型別。
報價:
"for not sortable keys the sort merge join" should "not be used" in {
import sparkSession.implicits._
// Here we explicitly define the schema. Thanks to that we can show
// the case when sort-merge join won't be used, i.e. when the key is not sortable
// (there are other cases - when broadcast or shuffle joins can be chosen over sort-merge
// but it's not shown here).
// Globally, a "sortable" data type is:
// - NullType, one of AtomicType
// - StructType having all fields sortable
// - ArrayType typed to sortable field
// - User Defined DataType backed by a sortable field
// The method checking sortability is org.apache.spark.sql.catalyst.expressions.RowOrdering.isOrderable
// As you see, CalendarIntervalType is not included in any of above points,
// so even if the data structure is the same (id login for customers, id customer id amount for orders)
// with exactly the same number of rows, the sort-merge join won't be applied here.
這是一個舊帖子,因為 v3 可以進行比較。 https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/types/CalendarIntervalType.html
但它證明了這一點。
另外,非 equi 連接呢?
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/428252.html
