我有一個這樣的資料框:
------------ ----------------- ------------------------------------
| Name | Age | Answers |
------------ ----------------- ------------------------------------
| Maria | 23 | [apple, mango, orange, banana] |
| John | 55 | [apple, orange, banana] |
| Brad | 44 | [banana] |
| Alex | 55 | [apple, mango, orange, banana] |
------------ ----------------- ------------------------------------
“答案”列包含一個元素陣列。
我的預期輸出:
----- --- -------- -------
| Name|Age| answer| value |
----- --- -------- -------
|Maria| 23| apple| True |
|Maria| 23| mango| True |
|Maria| 23| orange| True |
|Maria| 23| banana| True |
| John| 55| apple| True |
| John| 55| mango| False |
| John| 55| orange| True |
| John| 55| banana| True |
| Brad| 44| apple| False |
| Brad| 44| mango| False |
| Brad| 44| orange| False |
| Brad| 44| banana| True |
|Alex | 55| apple| True |
|Alex | 55| mango| True |
|Alex | 55| orange| True |
|Alex | 55| banana| True |
----- --- -------- -------
如何以這樣一種方式分解“答案”列,以便根據陣列獲得具有 True 或 False 的“值”列?
例如,
| John| 55| mango| False |
約翰的回答中沒有“芒果”。因此該值為假。同樣,對于 Brad,也會出現三個錯誤行。
uj5u.com熱心網友回復:
在爆炸之前,您可以在“答案”列中收集所有可能的值。將它們添加到資料框中,展開并選擇所需的列。
輸入:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Maria', 23, ['apple', 'mango', 'orange', 'banana']),
('John', 55, ['apple', 'orange', 'banana']),
('Brad', 44, ['banana']),
('Alex', 55, ['apple', 'mango', 'orange', 'banana'])],
['Name', 'Age', 'Answers'])
腳本:
unique_answers = set(df.agg(F.flatten(F.collect_set('Answers'))).head()[0])
df = df.withColumn('answer', F.explode(F.array([F.lit(x) for x in unique_answers])))
df = df.select(
'Name', 'Age', 'answer',
F.exists('Answers', lambda x: x == F.col('answer')).alias('value')
*[c for c in df.columns if c not in {'Name', 'Age', 'Answers', 'answer'}]
)
df.show()
# ----- --- ------ -----
# | Name|Age|answer|value|
# ----- --- ------ -----
# |Maria| 23|orange| true|
# |Maria| 23| mango| true|
# |Maria| 23| apple| true|
# |Maria| 23|banana| true|
# | John| 55|orange| true|
# | John| 55| mango|false|
# | John| 55| apple| true|
# | John| 55|banana| true|
# | Brad| 44|orange|false|
# | Brad| 44| mango|false|
# | Brad| 44| apple|false|
# | Brad| 44|banana| true|
# | Alex| 55|orange| true|
# | Alex| 55| mango| true|
# | Alex| 55| apple| true|
# | Alex| 55|banana| true|
# ----- --- ------ -----
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/525023.html
