我有一個 pyspark 資料框,其中有一列包含文本內容。
我正在嘗試計算包含感嘆號“!”的句子數量。以及“喜歡”和“想要”這個詞。
例如:具有包含以下句子的行的列:
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
我希望實作的期望輸出是這樣的(只計算包含“喜歡”或“想要”和“!”的句子):
---- -----
|word|count|
---- -----
|like| 2 |
|want| 2 |
---- -----
有人可以幫我寫一個可以做到這一點的UDF嗎?這是我到目前為止所寫的,但我似乎無法讓它作業。
nltk.tokenize import sent_tokenize
def convert_a_sentence(a_string):
sentence = lower(nltk.sent_tokenize(a_string))
return sentence
df = df.withColumn('a_sentence', convert_a_sentence(df['text']))
df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()
uj5u.com熱心網友回復:
如果你想要的只是uni-gram(即1個標記),你可以按空格分割句子,然后爆炸,分組,計數然后過濾你想要的
(df
.withColumn('words', F.split('sentence', ' '))
.withColumn('word', F.explode('words'))
.groupBy('word')
.agg(
F.count('*').alias('word_cnt')
)
.where(F.col('word').isin(['like', 'want']))
.show()
)
# Output
# ---- --------
# |word|word_cnt|
# ---- --------
# |want| 2|
# |like| 3|
# ---- --------
注意#1:您可以在之前應用過濾器groupBy,使用contains功能
注意#2:如果你想用 n-gram 而不是像上面那樣“hacking”,你可以考慮使用 SparkML 包和Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)
# Output
# ---------------------- ----------------------------
# |sentence |words |
# ---------------------- ----------------------------
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home! |[i, want, to, go, home!] |
# |I like fast food. |[i, like, fast, food.] |
# |you don't want to! |[you, don't, want, to!] |
# |what does he want? |[what, does, he, want?] |
# ---------------------- ----------------------------
或NGram
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)
# Output
# ---------------------- ---------------------------- ----------------------------------------
# |col |words |ngrams |
# ---------------------- ---------------------------- ----------------------------------------
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!] |
# |I want to go home! |[i, want, to, go, home!] |[i want, want to, to go, go home!] |
# |I like fast food. |[i, like, fast, food.] |[i like, like fast, fast food.] |
# |you don't want to! |[you, don't, want, to!] |[you don't, don't want, want to!] |
# |what does he want? |[what, does, he, want?] |[what does, does he, he want?] |
# ---------------------- ---------------------------- ----------------------------------------
uj5u.com熱心網友回復:
我不確定 pandas 或 pyspark 方法,但您可以使用函式輕松完成此操作
from nltk.tokenize import sent_tokenize
t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
if "!" in sentence and "like" in sentence:
print(f"found in {sentence}")
并且您應該能夠弄清楚如何計算/將其放入表格中...
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/319110.html
上一篇:加速實驗,限制Spark為單核
