我有一個news.txt型別的外部源檔案:
20030249,the old men
20040229,I like the way school and teachers work
20050249,another title goes here for any reason
20060269,text and strings are similar
20070551,cowbows love to ride horses
和一個words.txt單詞串列:
the
a
school
horses
以下代碼以以下形式創建對 RDD
['2003', ['the', 'old', 'men'], ['2004', ['I', 'like', 'the',...
在下面的代碼之后,我想添加一個RDD對轉換代碼來words.txt從RDD“對”的值串列中洗掉單詞:
source = sc.textFile("news.txt")
stopwords = sc.textFile("words.txt")
pair = source.map(lambda s: [s[0:4],s[9::].split(' ')])
我已經嘗試了幾個徒勞,但我確定我很接近:
pair1 = pair.filter(lambda x: x not in stopwords)
pair1 = pair.map(lambda ws: for w in ws if w not in stopwords)
pair1 = pair.filter(lambda a: a != stopwords)
pair1 = pair.mapValues(lambda x: x not in stopwords)
uj5u.com熱心網友回復:
我假設停用詞串列非常小。在這種情況下,最簡單的解決方案是收集然后廣播它。然后,您可以簡單地使用該停用詞串列對 RDD 中的詞串列應用一個簡單的過濾器:
stopwords_local = set(stopwords.collect())
stopwords_bc = sc.broadcast(stopwords_local)
result = pair.mapValues(lambda ws : [w for w in ws if not w in stopwords_bc.value])
如果stopwordsrdd 足夠大以至于你不能或不想廣播它,你可以通過展平第一個資料幀來解決問題,左加入stopwords,洗掉沒有匹配的行,然后分組回傳資料:
result = pair\
.flatMapValues(lambda x:x)\
.map(lambda x: (x[1], x[0]))\
.leftOuterJoin(stopwords.map(lambda x: (x, 1)))\
.filter(lambda x: x[1][1] is None)\
.map(lambda x: (x[1][0], x[0]))\
.groupByKey()\
.mapValues(lambda x: list(x))
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/372256.html
