如何從PySpark中的串列中選擇行-有解無憂

假設我們有兩個資料框df1，并且df2wheredf1有 columns[a, b, c, p, q, r]和df2有 columns [d, e, f, a, b, c]。假設公共列存盤在串列中common_cols = ['a', 'b', 'c']。

如何使用common_colssql 命令中的串列連接兩個資料框？下面的代碼試圖做到這一點。

common_cols = ['a', 'b', 'c']
filter_df = spark.sql("""
    select * from df1 inner join df2
    on df1.common_cols = df2.common_cols
""")

uj5u.com熱心網友回復：

演示設定

df1 = spark.createDataFrame([(1,2,3,4,5,6)],['a','b','c','p','q','r'])
df2 = spark.createDataFrame([(7,8,9,1,2,3)],['d','e','f','a','b','c'])
common_cols = ['a','b','c']

df1.show()

 --- --- --- --- --- --- 
|  a|  b|  c|  p|  q|  r|
 --- --- --- --- --- --- 
|  1|  2|  3|  4|  5|  6|
 --- --- --- --- --- --- 


df2.show()

 --- --- --- --- --- --- 
|  d|  e|  f|  a|  b|  c|
 --- --- --- --- --- --- 
|  7|  8|  9|  1|  2|  3|
 --- --- --- --- --- ---

解決方案，基于using (SQL syntax for join)

df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
common_cols_csv = ','.join(common_cols)

query = f'''\
select  * 
from    df1 inner join df2 using ({common_cols_csv})
'''

print(query)

select  * 
from    df1 inner join df2 using (a,b,c)

filter_df = spark.sql(query)

filter_df.show()

 --- --- --- --- --- --- --- --- --- 
|  a|  b|  c|  p|  q|  r|  d|  e|  f|
 --- --- --- --- --- --- --- --- --- 
|  1|  2|  3|  4|  5|  6|  7|  8|  9|
 --- --- --- --- --- --- --- --- ---

uj5u.com熱心網友回復：

您可以使用using代替來執行此操作on。請參閱檔案。

common_cols = ['a', 'b', 'c']

spark.sql(
    f'''
    SELECT *
    FROM
    (SELECT 1 a, 2 b, 3 c, 10 val1)
    JOIN
    (SELECT 1 a, 2 b, 3 c, 20 val2)
    USING ({','.join(common_cols)})
    '''
).show()

 --- --- --- ---- ---- 
|  a|  b|  c|val1|val2|
 --- --- --- ---- ---- 
|  1|  2|  3|  10|  20|
 --- --- --- ---- ----

uj5u.com熱心網友回復：

添加到@David ???? Markovitz's answer為了以動態方式獲取列，您可以執行以下操作 -

輸入資料

df1 = spark.createDataFrame([(1,2,3,4,5,6)],['a','b','c','p','q','r'])
df2 = spark.createDataFrame([(7,8,9,1,2,3)],['d','e','f','a','b','c'])

df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")

使用查找公共列set intersection

common_cols = set(df1.columns).intersection(set(df2.columns))
print(common_cols)

{'a', 'b', 'c'}

創造query string——

query = '''
select  * 
from    df1 inner join df2 using ({common_cols})
'''.format(common_cols=', '.join(map(str, common_cols)))

print(query)

select  * 
from    df1 inner join df2 using (a, b, c)

最后，執行query內spark.sql-

spark.sql(query).show()

 --- --- --- --- --- --- --- --- --- 
|  a|  b|  c|  p|  q|  r|  d|  e|  f|
 --- --- --- --- --- --- --- --- --- 
|  1|  2|  3|  4|  5|  6|  7|  8|  9|
 --- --- --- --- --- --- --- --- ---

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/447332.html

標籤：Python 阿帕奇火花 pyspark

上一篇：比較連續行并在spark中提取單詞（不包括子集）

下一篇：pyspark-將非空列分配給新列