我有一個增量表 A,如下所示。
| 觀點 | 簇 | points_in_cluster |
|---|---|---|
| 37 | 1 | [37,32] |
| 45 | 2 | [45,67,84] |
| 67 | 2 | [45,67,84] |
| 84 | 2 | [45,67,84] |
| 32 | 1 | [37,32] |
我還有一個表 B,如下所示。
| ID | 觀點 |
|---|---|
| 101 | 37 |
| 102 | 67 |
| 103 | 84 |
我想要一個像下面這樣的查詢。這里in顯然不適用于串列。那么,正確的語法是什么?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
因此,我應該有一個如下所示的表格
| ID | 觀點 |
|---|---|
| 101 | 37 |
| 101 | 32 |
| 102 | 45 |
| 102 | 67 |
| 102 | 84 |
| 103 | 45 |
| 103 | 67 |
| 103 | 84 |
uj5u.com熱心網友回復:
根據您的資料樣本,我會在pointcolumn上進行 equi-join ,然后在explodeon 上進行points_in_cluster:
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
否則,您使用array_contains:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/345639.html
