我收到這個錯誤
line 23, in parseRating
IndexError: list index out of range
...在任何嘗試時.collect(),.count()等等。所以最后一行df3.collect()拋出該錯誤,但所有.show()的作業。我不認為這是資料的問題,但我可能是錯的。
新來的,真的不知道發生了什么。非常感謝任何幫助。
import os
from os import remove, removedirs
from os.path import join, isfile, dirname
from pyspark.sql.functions import col, explode
import pandas as pd
from pyspark.sql.functions import col, explode
from pyspark import SparkContext
from pyspark.sql import SparkSession
def parseRating(line):
"""
Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
"""
fields = line.strip().split("::")
return int(fields[3]), int(fields[0]), int(fields[1]), float(fields[2])
#return int(fields[0]), int(fields[1]), float(fields[2])
if __name__ == "__main__":
# set up environment
spark = SparkSession.builder \
.master("local") \
.appName("Movie Recommendation Engine") \
.config("spark.driver.memory", "16g") \
.getOrCreate() \
sc = spark.sparkContext
# load personal ratings
#myRatings = loadRatings(os.path.abspath('personalRatings.txt'))
myRatingsRDD = sc.textFile("personalRatings.txt").map(parseRating)
ratings = sc.textFile("ratings.dat").map(parseRating)
df1 = spark.createDataFrame(myRatingsRDD,["timestamp","userID","movieID","rating"])
df1.show()
df2 = spark.createDataFrame(ratings,["timestamp","userID","movieID","rating"])
df2.show()
df3 = df1.union(df2)
df3.show()
df3.printSchema()
df3 = df3.\
withColumn('userID', col('userID').cast('integer')).\
withColumn('movieID', col('movieID').cast('integer')).\
withColumn('rating', col('rating').cast('float')).\
drop('timestamp')
df3.show()
ratings = df3
df3.collect()
uj5u.com熱心網友回復:
您的文本檔案中的其中一行可能格式錯誤/不完整,因此split("::")可能無法生成預期欄位的數量。在嘗試訪問索引之前,您可以更新您的函式以檢查拆分的數量。例如。
def parseRating(line):
"""
Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
"""
fields = line.strip().split("::")
timestamp = int(fields[3]) if len(fields)>3 else None
userId = int(fields[0]) if len(fields)>0 else None
movieId = int(fields[1]) if len(fields)>1 else None
rating = float(fields[2]) if len(fields)>2 else None
return timestamp, userId, movieId, rating
如果需要,您甚至可以進行更多的例外處理。
讓我知道這是否適合您。
uj5u.com熱心網友回復:
錯誤來自函式parseRating,它與串列索引超出范圍有關。資料中可能有一些行在被::分隔符分割后沒有預期的欄位數。
如何將文本檔案直接匯入指定欄位分隔符和標題 true/false 的資料幀,并使用cast.
像這樣的東西:
df1 = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter", "::") \
.load("personalRatings.txt")
df1 = df1.select(df1.timestamp.cast("int"),df1.userId.cast("int"),df1.movieId.cast("int"),df1.rating.cast("float"))
df1.show(10)
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/311452.html
