我有以下 pyspark 資料框
a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']
df = pd.DataFrame(dict(dcode=a, zone=b))
dcode zone
0 480s North
1 480s West
2 499s East
3 499s North
4 650s East
5 650s North
6 702s North
7 702s West
8 736s North
9 736s South
10 736s West
11 737s North
12 737s West
我希望我的資料框看起來像 -
dcode zone output
0 480s North NW
1 480s West NW
2 499s East
3 499s North NW
4 650s East
5 650s North NW
6 702s North
7 702s West
8 736s North
9 736s South
10 736s West
11 737s North
12 737s West
同樣,我使用這個邏輯,但它沒有給出想要的結果。
df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))
僅當區域為北或西并且解碼不在 736,737s,702s 中時,我才希望輸出列中的 NW。
uj5u.com熱心網友回復:
請考慮首先將您的轉換pandas df為spark一個,因為您使用的是pypark語法。然后,我建議使用以下方法將您的代碼重寫為更簡潔明了的方式isin:
from pyspark.sql import functions as F
df = spark.createDataFrame(df)
df_ = df.withColumn("output", F.when(
(F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
),"NW").otherwise(""))
>>> df_.show(truncate=False)
----- ----- ------
|dcode|zone |output|
----- ----- ------
|480s |North|NW |
|480s |West |NW |
|499s |East | |
|499s |North|NW |
|650s |East | |
|650s |North|NW |
|702s |North| |
|702s |West | |
|736s |North| |
|736s |South| |
|736s |West | |
|737s |North| |
|737s |West | |
----- ----- ------
uj5u.com熱心網友回復:
您可以直接使用 SQL 樣式運算式(expr函式)。
import pyspark.sql.functions as F
......
df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
......
uj5u.com熱心網友回復:
只需檢查您的括號
順便說一句,df = pd.DataFrame(dict(dcode=a, zone=b))不是PySpark
from pyspark.sql import functions as F
import pandas as pd
a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']
df = pd.DataFrame(dict(dcode=a, zone=b))
df_ = spark.createDataFrame(df)
df_ = df_.withColumn("output", F.when((\
((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
), "NW"))
df_.show()
----- ----- ------
|dcode| zone|output|
----- ----- ------
| 480s|North| NW|
| 480s| West| NW|
| 499s| East| null|
| 499s|North| NW|
| 650s| East| null|
| 650s|North| NW|
| 702s|North| NW|
| 702s| West| NW|
| 736s|North| NW|
| 736s|South| null|
| 736s| West| NW|
| 737s|North| NW|
| 737s| West| NW|
----- ----- ------
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/345665.html
