我是 pyspark 的新用戶。我想比較來自兩個不同資料框(包含新聞資訊)的文本以進行推薦。
我可以很容易地用 Python 做到這一點:
def get_recommendations(title, cosine_sim, indices):
idx = indices[title]
# Get the pairwsie similarity scores
sim_scores = list(enumerate(cosine_sim[idx]))
print(sim_scores)
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores for 10 most similar movies
sim_scores = sim_scores[1:11]
talk_indices = [i[0] for i in sim_scores]
# Return the top 10 most
return ted['News Data'].iloc[talk_indices]
indices = pd.Series(det.index, index=det['Unnamed: 0']).drop_duplicates()
transcripts = det['News Data']
transcripts2 = ted['News Data']
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(transcripts)
tfidf_matrixx = tfidf.transform(transcripts2)
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrixx)
print(get_recommendations(0, cosine_sim, indices))
當我切換到 pyspark 時,我在計算 TF-IDF 時得到了非常不同的結果。我知道需要計算余弦相似度才能做出“新聞”推薦。
我在 Pyspark 中使用以下內容進行 tfidf 計算:
df1 = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('bbcclear.csv')
df2 = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('yenisafakcategorypredict.csv')
# tokenize
tokenizer = Tokenizer().setInputCol("News Data").setOutputCol("word")
wordsData = tokenizer.transform(df2)
wordsData2 = tokenizer.transform(df1)
# vectorize
vectorizer = CountVectorizer(inputCol='word', outputCol='vectorizer').fit(wordsData)
wordsData = vectorizer.transform(wordsData)
wordsData2 = vectorizer.transform(wordsData2)
# calculate scores
idf = IDF(inputCol="vectorizer", outputCol="tfidf_features")
idf_model = idf.fit(wordsData)
wordsData = idf_model.transform(wordsData)
idf_model = idf.fit(wordsData2)
wordsData2 = idf_model.transform(wordsData2)
如何使用上面獲得的 ID-IDF 計算余弦相似度以進行推薦?
uj5u.com熱心網友回復:
以下是我的 PoC 作業中在 Spark 中使用 TF-IDF 的示例。我強烈建議使用 BERT 等高級 NLP 框架,而不是 TF-IDF,以獲得有意義的相似性。
樣本資料集:
df = spark.createDataFrame(
[
["cricket sport team player"],
["global politics"],
["football sport player team"],
],
["news"]
)
--------------------------
|news |
--------------------------
|cricket sport team player |
|global politics |
|football sport player team|
--------------------------
TF-IDF 向量化和余弦相似度計算:
regex_tokenizer = RegexTokenizer(gaps=False, pattern="\w ", inputCol="news", outputCol="tokens")
count_vectorizer = CountVectorizer(inputCol="tokens", outputCol="tf")
idf = IDF(inputCol="tf", outputCol="idf")
tf_idf_pipeline = Pipeline(stages=[regex_tokenizer, count_vectorizer, idf])
df = tf_idf_pipeline.fit(df).transform(df).drop("news", "tokens", "tf")
df = df.crossJoin(df.withColumnRenamed("idf", "idf2"))
@F.udf(returnType=FloatType())
def cos_sim(u, v):
return float(u.dot(v) / (u.norm(2) * v.norm(2)))
#
df.withColumn("cos_sim", cos_sim(F.col("idf"), F.col("idf2")))
-------------------- -------------------- ----------
| idf| idf2| cos_sim|
-------------------- -------------------- ----------
|(7,[0,1,2,4],[0.2...|(7,[0,1,2,4],[0.2...| 1.0|
|(7,[0,1,2,4],[0.2...|(7,[5,6],[0.69314...| 0.0|
|(7,[0,1,2,4],[0.2...|(7,[0,1,2,3],[0.2...|0.34070355|
|(7,[5,6],[0.69314...|(7,[0,1,2,4],[0.2...| 0.0|
|(7,[5,6],[0.69314...|(7,[5,6],[0.69314...| 1.0|
|(7,[5,6],[0.69314...|(7,[0,1,2,3],[0.2...| 0.0|
|(7,[0,1,2,3],[0.2...|(7,[0,1,2,4],[0.2...|0.34070355|
|(7,[0,1,2,3],[0.2...|(7,[5,6],[0.69314...| 0.0|
|(7,[0,1,2,3],[0.2...|(7,[0,1,2,3],[0.2...| 1.0|
-------------------- -------------------- ----------
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/522470.html
下一篇:PowerBI-客戶購買計數
