我有一個資料框:
--- --- --- ------
| id|foo|bar|rownum|
--- --- --- ------
| 1|123|123| 1|
| 2|000|236| 1|
| 2|236|236| 2|
| 2|000|236| 3|
| 3|333|234| 1|
| 3|444|444| 2|
--- --- --- ------
我想添加一個包含 where 的列match,rownum例如foo==bar:
--- --- --- ------ ----
| id|foo|bar|rownum|match
--- --- --- ------ ----
| A|123|123| 1| 1|
| B|000|236| 1| 2|
| B|236|236| 2| 2|
| B|000|236| 3| 2|
| R|333|234| 1| 2|
| R|444|444| 2| 2|
--- --- --- ------ ----
我試過這個:
df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum')))
uj5u.com熱心網友回復:
嘗試使用視窗函式。
from pyspark.sql import functions as F, Window as W
df_grp2 = spark.createDataFrame(
[(1, '123', '123', 1),
(2, '000', '236', 1),
(2, '236', '236', 2),
(2, '000', '236', 3),
(3, '333', '234', 1),
(3, '444', '444', 2)],
['id', 'foo', 'bar', 'rownum']
)
df_grp2 = df_grp2.withColumn(
'match',
F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id'))
)
df_grp2.show()
# --- --- --- ------ -----
# | id|foo|bar|rownum|match|
# --- --- --- ------ -----
# | 1|123|123| 1| 1|
# | 2|000|236| 1| 2|
# | 2|236|236| 2| 2|
# | 2|000|236| 3| 2|
# | 3|333|234| 1| 2|
# | 3|444|444| 2| 2|
# --- --- --- ------ -----
uj5u.com熱心網友回復:
用這個 :
df['match'] = df.loc[df['foo'] == df['bar']]['rownum']
但如果它們不匹配,則回傳“NAN”
--- --- --- ------ ----
| id|foo|bar|rownum|match
--- --- --- ------ ----
| A|123|123| 1| 1|
| B|000|236| 1| NAN|
| B|236|236| 2| 2|
| B|000|236| 3| NAN|
| R|333|234| 1| NAN|
| R|444|444| 2| 2|
--- --- --- ------ ----
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/465518.html
標籤:阿帕奇火花 pyspark apache-spark-sql 条件语句
上一篇:將函式回傳的值寫入串列
