Pyspark：創建資料框-Map型別的布爾欄位被決議為null-有解無憂

我正在從 python 串列創建資料框，如下所示，

_test = [('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
           ('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})]
df_test = spark.createDataFrame(_test, schema = ["col1","col2"])
df_test.show(truncate=False)

但是，結果資料框的所有布爾欄位都為空！

 ---- --------------------------------------------------------- 
|col1|col2                                                     |
 ---- --------------------------------------------------------- 
|val1|[key1 -> [A, B], bool_key2 ->, key2 -> [C], bool_key1 ->]|
|val2|[key1 -> [B], bool_key2 ->, key2 -> [D], bool_key1 ->]   |
 ---- ---------------------------------------------------------

df_test 資料框架構

root
 |-- col1: string (nullable = true)
 |-- col2: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: string (containsNull = true)

我可以在不更改 python 變數結構的情況下創建資料框有什么幫助嗎？

uj5u.com熱心網友回復：

定義模式并且不使用元組定義行。使用串列。試試下面的代碼

_test1 = [["val1",{"key1": ["A", "B"], "key2": ["C"], "bool_key1": True, "bool_key2": True}],
         ["val1",{"key1": ["A", "B"], "key2": ["C"], "bool_key1": True, "bool_key2": True}],
         ["val2", {"key1": ["B"], "key2": ["D"], "bool_key1": False, "bool_key2": None}]]

df2=spark.createDataFrame(_test1, 'col1 string, col2 struct<key1:array<string>,key2:array<string>,bool_key1:boolean,bool_key1:boolean>')
df2.show(truncate=False)

 ---- ------------------------- 
|col1|col2                     |
 ---- ------------------------- 
|val1|{[A, B], [C], true, true}|
|val1|{[A, B], [C], true, true}|
|val2|{[B], [D], false, false} |
 ---- ------------------------- 

root
 |-- col1: string (nullable = true)
 |-- col2: struct (nullable = true)
 |    |-- key1: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- key2: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- bool_key1: boolean (nullable = true)
 |    |-- bool_key1: boolean (nullable = true)

uj5u.com熱心網友回復：

除了@wwnde 的答案之外，還有另一種定義結構模式的方法（盡管個人更喜歡@wwnde 的答案（代碼行更少）） -

定義struct架構 -

from pyspark.sql.types import *

schema = StructType(
                    [
                      StructField("col1", StringType()),
                      StructField("col2", StructType([
                                                       StructField("key1", ArrayType(StringType())),
                                                       StructField("key2", ArrayType(StringType())),
                                                       StructField("bool_key1", BooleanType()),
                                                       StructField("bool_key2", BooleanType())
                                                      ]
                                                     )
                                 )
                    ]
                  )

創建dataframe-

_test = [
         ('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
         ('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})
        ]

df=spark.createDataFrame(data=_test, schema=schema)
df.printSchema()

輸出

root
 |-- col1: string (nullable = true)
 |-- col2: struct (nullable = true)
 |    |-- key1: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- key2: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- bool_key1: boolean (nullable = true)
 |    |-- bool_key2: boolean (nullable = true)

如果您想讓MapType key value配對完好無損，請嘗試使用以下邏輯 -

_test = [
         ('val1', {'key1': ['A', 'B'], 'key2': ['C'], 'bool_key1': True, 'bool_key2': True}),
         ('val2', {'key1': ['B'], 'key2': ['D'], 'bool_key1': False, 'bool_key2': None})
        ]

schema = StructType([
                      StructField("col1", StringType()),
                      StructField("col2", (MapType(StringType(), StringType())))
])

spark.createDataFrame(_test, schema=["col1", "col2"]).show(truncate=False)

df_test = spark.createDataFrame(data = _test, schema = schema)
df_test.show(truncate=False)

 ---- ------------------------------------------------------------------- 
|col1|col2                                                               |
 ---- ------------------------------------------------------------------- 
|val1|{key1 -> [A, B], bool_key2 -> true, key2 -> [C], bool_key1 -> true}|
|val2|{key1 -> [B], bool_key2 -> null, key2 -> [D], bool_key1 -> false}  |
 ---- -------------------------------------------------------------------

df_test.printSchema()

root
 |-- col1: string (nullable = true)
 |-- col2: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/453287.html

標籤：阿帕奇火花 pyspark

上一篇：過濾器然后計算許多不同的閾值

下一篇：在Pyspark中使用正則運算式從(MonthDay,Year,HH:MM:SS)日期時間格式中提取年份