我必須找到 Yelp 資料集中大多數簽到發生的確切時間,但由于某種原因我遇到了這個錯誤。到目前為止,這是我的代碼:
from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F
square_udf_int = udf(lambda z: square(z), IntegerType())
checkin = spark.read.json('yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates = checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates = dates.select("checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)
和錯誤:
Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
- Project [checkin_date#176]
- Project [business_id#6, dates#172, checkin_date#176]
- Generate explode(dates#172), false, [checkin_date#176]
- Project [business_id#6, <lambda>(date#7) AS dates#172]
- Relation[business_id#6,date#7] json
日期只是一個 Spark 資料框,其中一列名為:“checkin_date”,只有日期時間,所以我不確定為什么這不起作用。
uj5u.com熱心網友回復:
您獲得的錯誤僅表示在以下代碼行中,您嘗試訪問名為的列timestamp但該列不存在。
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))
事實上,這是to_timestamp函式的簽名:
pyspark.sql.functions.to_timestamp(col, format=None)
第一個引數是列,第二個引數是格式。我假設您正在嘗試決議日期然后截斷它。假設您想將日期截斷為月份級別。正確的做法是:
dates.withColumn("checkin_date", F.date_trunc('month',
F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/347843.html
