我想加入兩個資料集,如下所示:
資料集 1:
PIN LOCATION
1234 Germany
2356 Poland
2894 England
3452 Bloomberg
資料集 2:
MAIL STARTLOCATION ENDLOCATION
[email protected] 1234 2894
[email protected] 2356 1234
[email protected] 3452 2894
輸出應該是:
MAIL STARTLOCATION ENDLOCATION LOCATION1 LOCATION2
[email protected] 1234 2894 Germany England
[email protected] 2356 1234 poland Germany
[email protected] 3452 2894 Bloomberg England
嘗試過:
condi = [((df1.PIN == df2.STARTLOCATION) | (df1.PIN == df2.ENDLOCATION))]
joindata = df1.join(df2, on = condi, how = 'outer').select('*')
但它給 NULL中LOCATION1和LOCATION2
uj5u.com熱心網友回復:
您必須將它加入兩次才能獲得所需的結果,因為您期望結果中有兩個額外的列。
df1.show()
---- ---------
| PIN| LOCATION|
---- ---------
|1234| Germany|
|2356| Poland|
|2894| England|
|3452|Bloomberg|
---- ---------
df2.show()
------------- ------------- -----------
| MAIL|STARTLOCATION|ENDLOCATION|
------------- ------------- -----------
| [email protected]| 1234| 2894|
| [email protected]| 2356| 1234|
|[email protected]| 3452| 2894|
------------- ------------- -----------
df2.join(df1, df2.STARTLOCATION==df1.PIN)\
.withColumnRenamed("LOCATION", "LOCATION1")\
.drop("PIN")\
.join(df1, df2.ENDLOCATION==df1.PIN)\
.withColumnRenamed("LOCATION", "LOCATION2")\
.drop("PIN")\
.show()
------------- ------------- ----------- --------- ---------
| MAIL|STARTLOCATION|ENDLOCATION|LOCATION1|LOCATION2|
------------- ------------- ----------- --------- ---------
|[email protected]| 3452| 2894|Bloomberg| England|
| [email protected]| 1234| 2894| Germany| England|
| [email protected]| 2356| 1234| Poland| Germany|
------------- ------------- ----------- --------- ---------
uj5u.com熱心網友回復:
如果你不介意使用sql陳述句來實作,你可以嘗試:
df1.createOrReplaceTempView('tmp1')
df2.createOrReplaceTempView('tmp2')
sql = """
select a.MAIL,a.STARTLOCATION,a.ENDLOCATION,b.LOCATION as LOCATION1,c.LOCATION as LOCATION2
from tmp2 a join tmp1 b on a.STARTLOCATION=b.PIN
join tmp1 c on a.ENDLOCATION=c.PIN
"""
df = spark.sql(sql)
df.show(truncate=False)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/402985.html
標籤:
