將一列假資料添加到pyspark中的資料幀：不支持的文字型別類-有解無憂

我正在嘗試向我的資料集添加一個額外的新列假資料。以這個為例（資料幀是什么并沒有什么不同 - 我需要一個新的帶有唯一假名的額外列；這只是一個可以玩的假人）：

from faker import Faker

faker = Faker("en_GB")

profiles = [faker.profile() for i in range(0, 100)]
profiles = spark.createDataFrame(profiles)

我正在嘗試添加一個新的名字列，每行一個名字。目前，我正在這樣做（我知道這不會做我想要的，但我不知道還能做什么）：

profiles = profiles.withColumn('first_name', lit([faker.first_name()] for _ in 'name'))

但是，我不斷收到此錯誤：

java.lang.RuntimeException：不支持的文字型別類 java.util.ArrayList [[Robin], [Margaret], [Robin], [Victor]] 我想把它保留在一行，因為這是我需要進一步分析的.

我想我明白為什么我會收到錯誤，但我不知道該怎么辦……任何想法都值得贊賞！

uj5u.com熱心網友回復：

嘗試這樣的事情：

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from faker import Faker

faker = Faker("en_GB")

spark = SparkSession.builder.getOrCreate()
profiles = [faker.profile() for i in range(0, 100)]
profiles = spark.createDataFrame(profiles)
fake_names = [faker.first_name() for _ in range(profiles.count())]
profiles = profiles.withColumn(
    "first_name", F.udf(lambda x: fake_names[x])(F.monotonically_increasing_id())
)

需要在資料框之外生成假名稱。如果您使用：

profiles.withColumn("first_name", F.lit(faker.first_name()))

您將獲得所有行的相同假名。

編輯：

row_number 例子：

fake_names = [faker.first_name() for _ in range(profiles.count())]
window = Window.orderBy("name") # Or any other unique column, but I guess name is unique here
profiles = profiles.withColumn(
    "first_name", F.udf(lambda x: fake_names[x - 1])(F.row_number().over(window))
)

uj5u.com熱心網友回復：

這是你想要的嗎？

from faker import Faker

faker = Faker("en_GB")

profiles = [[faker.profile(), faker.first_name()] for i in range(0, 100)]
profiles = spark.createDataFrame(profiles, ["profile", "first_name"])

profiles.show()

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/365955.html

標籤：Python 阿帕奇火花火花骗子

上一篇：拆分pyspark資料框列并限制拆分

下一篇：如何使用SparkSQL查詢過濾中文列名？