Pyspark-計算句子中的特定單詞-有解無憂

我有一個 pyspark 資料框，其中有一列包含文本內容。

我正在嘗試計算包含感嘆號“！”的句子數量。以及“喜歡”和“想要”這個詞。

例如：具有包含以下句子的行的列：

I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?

我希望實作的期望輸出是這樣的（只計算包含“喜歡”或“想要”和“！”的句子）：

 ---- ----- 
|word|count|
 ---- ----- 
|like|   2 |
|want|   2 |
 ---- -----

有人可以幫我寫一個可以做到這一點的UDF嗎？這是我到目前為止所寫的，但我似乎無法讓它作業。

nltk.tokenize import sent_tokenize

def convert_a_sentence(a_string):
    sentence = lower(nltk.sent_tokenize(a_string))
    return sentence

df = df.withColumn('a_sentence', convert_a_sentence(df['text']))

df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()

uj5u.com熱心網友回復：

如果你想要的只是uni-gram（即1個標記），你可以按空格分割句子，然后爆炸，分組，計數然后過濾你想要的

(df
    .withColumn('words', F.split('sentence', ' '))
    .withColumn('word', F.explode('words'))
    .groupBy('word')
    .agg(
        F.count('*').alias('word_cnt')
    )
    .where(F.col('word').isin(['like', 'want']))
    .show()
)

# Output
#  ---- -------- 
# |word|word_cnt|
#  ---- -------- 
# |want|       2|
# |like|       3|
#  ---- --------

注意#1：您可以在之前應用過濾器groupBy，使用contains功能

注意#2：如果你想用 n-gram 而不是像上面那樣“hacking”，你可以考慮使用 SparkML 包和Tokenizer

from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)

# Output
#  ---------------------- ---------------------------- 
# |sentence              |words                       |
#  ---------------------- ---------------------------- 
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home!    |[i, want, to, go, home!]    |
# |I like fast food.     |[i, like, fast, food.]      |
# |you don't want to!    |[you, don't, want, to!]     |
# |what does he want?    |[what, does, he, want?]     |
#  ---------------------- ----------------------------

或NGram

from pyspark.ml.feature import NGram

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)

# Output
#  ---------------------- ---------------------------- ---------------------------------------- 
# |col                   |words                       |ngrams                                  |
#  ---------------------- ---------------------------- ---------------------------------------- 
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!]  |
# |I want to go home!    |[i, want, to, go, home!]    |[i want, want to, to go, go home!]      |
# |I like fast food.     |[i, like, fast, food.]      |[i like, like fast, fast food.]         |
# |you don't want to!    |[you, don't, want, to!]     |[you don't, don't want, want to!]       |
# |what does he want?    |[what, does, he, want?]     |[what does, does he, he want?]          |
#  ---------------------- ---------------------------- ----------------------------------------

uj5u.com熱心網友回復：

我不確定 pandas 或 pyspark 方法，但您可以使用函式輕松完成此操作

from nltk.tokenize import sent_tokenize

t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
  if "!" in sentence and "like" in sentence:
    print(f"found in {sentence}")

并且您應該能夠弄清楚如何計算/將其放入表格中...

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/319110.html

標籤：Python 阿帕奇火花火花数据科学 nltk

上一篇：加速實驗，限制Spark為單核

下一篇：使用帶有特定分隔符/分隔符的自動加載器攝取CSV資料