加入pysprak資料以期望輸出-有解無憂

我想加入兩個資料集，如下所示：

資料集 1：

PIN           LOCATION
1234          Germany
2356          Poland
2894          England
3452          Bloomberg

資料集 2：

MAIL           STARTLOCATION      ENDLOCATION
[email protected]        1234             2894
[email protected]        2356             1234
[email protected]       3452             2894

輸出應該是：

MAIL           STARTLOCATION      ENDLOCATION  LOCATION1       LOCATION2
[email protected]        1234             2894       Germany         England
[email protected]        2356             1234       poland          Germany
[email protected]       3452             2894      Bloomberg        England

嘗試過：

condi = [((df1.PIN == df2.STARTLOCATION) | (df1.PIN == df2.ENDLOCATION))]

joindata = df1.join(df2, on = condi, how = 'outer').select('*')

但它給 NULL中LOCATION1和LOCATION2

uj5u.com熱心網友回復：

您必須將它加入兩次才能獲得所需的結果，因為您期望結果中有兩個額外的列。

df1.show()
 ---- --------- 
| PIN| LOCATION|
 ---- --------- 
|1234|  Germany|
|2356|   Poland|
|2894|  England|
|3452|Bloomberg|
 ---- --------- 

df2.show()
 ------------- ------------- ----------- 
|         MAIL|STARTLOCATION|ENDLOCATION|
 ------------- ------------- ----------- 
| [email protected]|         1234|       2894|
| [email protected]|         2356|       1234|
|[email protected]|         3452|       2894|
 ------------- ------------- ----------- 

df2.join(df1, df2.STARTLOCATION==df1.PIN)\
    .withColumnRenamed("LOCATION", "LOCATION1")\
    .drop("PIN")\
    .join(df1, df2.ENDLOCATION==df1.PIN)\
    .withColumnRenamed("LOCATION", "LOCATION2")\
    .drop("PIN")\
    .show()
 ------------- ------------- ----------- --------- --------- 
|         MAIL|STARTLOCATION|ENDLOCATION|LOCATION1|LOCATION2|
 ------------- ------------- ----------- --------- --------- 
|[email protected]|         3452|       2894|Bloomberg|  England|
| [email protected]|         1234|       2894|  Germany|  England|
| [email protected]|         2356|       1234|   Poland|  Germany|
 ------------- ------------- ----------- --------- ---------

uj5u.com熱心網友回復：

如果你不介意使用sql陳述句來實作，你可以嘗試：

df1.createOrReplaceTempView('tmp1')
df2.createOrReplaceTempView('tmp2')
sql = """
    select a.MAIL,a.STARTLOCATION,a.ENDLOCATION,b.LOCATION as LOCATION1,c.LOCATION as LOCATION2
    from tmp2 a join tmp1 b on a.STARTLOCATION=b.PIN
        join tmp1 c on a.ENDLOCATION=c.PIN
"""
df = spark.sql(sql)
df.show(truncate=False)

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/402985.html

標籤：

上一篇：將json回應轉換為python中的格式化表

下一篇：列舉：設定成員值等于他們的名字，不好的做法？