我有這樣一個火花資料框:
data = [
["A", "false", "B", "true", "C", "false", "D", "false"],
["A", "false", "B", "false", "D", "true", "C", "false"],
["A", "false", "B", "false", "C", "false", "D", "false"],
["A", "true", "C", "true", "B", "false", "D", "false"]
]
columns = ["Label_1_name", "Label_1_value", "Label_2_name", "Label_2_value", "Label_3_name", "Label_3_value", "Label_4_name", "Label_4_value"]
df = spark.createDataFrame(data, columns)
df.show()
------------ ------------- ------------ ------------- ------------ ------------- ------------ -------------
|Label_1_name|Label_1_value|Label_2_name|Label_2_value|Label_3_name|Label_3_value|Label_4_name|Label_4_value|
------------ ------------- ------------ ------------- ------------ ------------- ------------ -------------
| A| false| B| true| C| false| D| false|
| A| false| B| false| D| true| C| false|
| A| false| B| false| C| false| D| false|
| A| true| C| true| B| false| D| false|
------------ ------------- ------------ ------------- ------------ ------------- ------------ -------------
我的目標是將此資料框轉換為只有 4 列的資料框,其中列名為“A”、“B”、“C”和“D”,列值為 0(表示 false)或 1 (為真),取決于與特定列對應的值。
問題是資料是臟的,“Label_1”不一定總是對應于“A”列,同樣,“Label_4”不一定總是對應于“D”列。
這是預期的火花資料幀輸出:
---- ------ ------ ------
| A| B| C| D|
---- ------ ------ ------
| 0| 1| 0| 0|
| 0| 0| 0| 1|
| 0| 0| 0| 0|
| 1| 0| 1| 0|
---- ------ ------ ------
uj5u.com熱心網友回復:
嗨,我認為這應該可以解決問題:
from pyspark.sql import functions as f, DataFrame
label_value_list = ["Label_1_name", "Label_1_value", "Label_2_name", "Label_2_value", "Label_3_name", "Label_3_value",
"Label_4_name", "Label_4_value"]
def create_map(input_df: DataFrame, label_column_list):
return input_df.withColumn("combMap", f.create_map(label_column_list))
def create_cols(letter_list):
for letter in letter_list:
yield f.col(f"combMap.{letter}").cast("boolean").cast("int").alias(letter)
df_with_map = create_map(df, label_value_list)
final_cols = list(create_cols(["A", "B", "C", "D"]))
final_df = df_with_map.select(final_cols)
final_df.show(truncate=False)
output:
--- --- --- ---
|A |B |C |D |
--- --- --- ---
|0 |1 |0 |0 |
|0 |0 |0 |1 |
|0 |0 |0 |0 |
|1 |0 |1 |0 |
--- --- --- ---
請記住,這并不能保證訂單。如果由于某種原因你需要A:1,0,0,0而不是A:0,0,0,1你需要一個明確的排序列。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/512314.html
