我當前的 pyspark 資料框是這樣的:
Region Location Month Services Type values_in_millions values_in_percent
USA USA 1/1/2021 ABC DC 101537.553 34.775
Europe Italy 2/1/2021 ABC DC 434404.87 44.653
Europe Spain 2/1/2021 ABC DC 895057.332 21.925
Asia India 3/1/2021 ABC DC 211963.21 27.014
我想要的資料框應該是這種形式:
Region Location Month Services Type key_1 values_1 values_2
USA USA 1/1/2021 ABC DC values_in_millions 101537.553
Europe Italy 2/1/2021 ABC DC values_in_millions 434404.87
Europe Spain 2/1/2021 ABC DC values_in_millions 895057.332
Asia India 3/1/2021 ABC DC values_in_millions 211963.21
USA USA 1/1/2021 ABC DC values_in_percent 34.775%
Europe Italy 2/1/2021 ABC DC values_in_percent 44.653%
Europe Spain 2/1/2021 ABC DC values_in_percent 21.925%
Asia India 3/1/2021 ABC DC values_in_percent 27.014%
任何方法都會有所幫助..
uj5u.com熱心網友回復:
您可以創建 2 個單獨的資料框。df1 與 key_1 = 'value_in_millions' 和 df2 與 key_1 = 'value_in_percent'。我在下面做的是首先選擇所需的列,對 key_1 列中的值進行硬編碼,對 'values_1' 和 'values_2' 列進行硬編碼,最后重新選擇這些列,使它們以相同的順序排列。
from pyspark.sql.functions import lit
df1 = df.select("Region","Location","Month","Services","Type","value_in_millions").withColumn("key_1",lit("value_in_millions")).withColumn("values_2",lit("")).withColumnRenamed("value_in_millions", "values_1").select("Region","Location","Month","Services","Type","key_1","values_1","values_2")
df2 = df.select("Region","Location","Month","Services","Type","value_in_percent").withColumn("key_1",lit("value_in_percent")).withColumn("values_1",lit("")).withColumnRenamed("value_in_percent", "values_2").select("Region","Location","Month","Services","Type","key_1","values_1","values_2")
df1.show()
df2.show()
輸出如下。
------ -------- ---------- -------- ---- ----------------- -------- --------
|Region|Location| Month|Services|Type| key_1|values_1|values_2|
------ -------- ---------- -------- ---- ----------------- -------- --------
| USA| USA|2001-01-01| ABC| DC|value_in_millions| 100000| |
| IND| DLH|2001-01-01| ABC| DC|value_in_millions| 200000| |
| NYC| NYC|2001-01-01| ABC| DC|value_in_millions| 300000| |
| UK| WALES|2001-01-01| ABC| DC|value_in_millions| 400000| |
------ -------- ---------- -------- ---- ----------------- -------- --------
------ -------- ---------- -------- ---- ---------------- -------- --------
|Region|Location| Month|Services|Type| key_1|values_1|values_2|
------ -------- ---------- -------- ---- ---------------- -------- --------
| USA| USA|2001-01-01| ABC| DC|value_in_percent| | 34|
| IND| DLH|2001-01-01| ABC| DC|value_in_percent| | 35|
| NYC| NYC|2001-01-01| ABC| DC|value_in_percent| | 36|
| UK| WALES|2001-01-01| ABC| DC|value_in_percent| | 37|
------ -------- ---------- -------- ---- ---------------- -------- --------
一旦它們以相同的順序重新排序,我就可以對 2 個資料幀進行聯合。
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
df3 = unionAll(df1, df2)
df3.show()
在下面輸出。
------ -------- ---------- -------- ---- ----------- -------- --------
|Region|Location| Month|Services|Type| key1|values_1|values_2|
------ -------- ---------- -------- ---- ----------- -------- --------
| USA| USA|2001-01-01| ABC| DC|valueinmill| 100000| |
| IND| DLH|2001-01-01| ABC| DC|valueinmill| 200000| |
| NYC| NYC|2001-01-01| ABC| DC|valueinmill| 300000| |
| UK| WALES|2001-01-01| ABC| DC|valueinmill| 400000| |
| USA| USA|2001-01-01| ABC| DC| valueinpct| | 34|
| IND| DLH|2001-01-01| ABC| DC| valueinpct| | 35|
| NYC| NYC|2001-01-01| ABC| DC| valueinpct| | 36|
| UK| WALES|2001-01-01| ABC| DC| valueinpct| | 37|
------ -------- ---------- -------- ---- ----------- -------- --------
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/363980.html
