在使用WordCloudforPython時，為什么要在云中考慮字母“S”的頻率？-有解無憂

我開始了解 Python 的 WordCloud 包，我正在使用 NLTK 的 Moby Dick Text 對其進行測驗。一個片段如下：

我的示例字串的片段

正如您從影像中的亮點中看到的那樣，所有所有格撇號都已轉義為“/'S”，并且 WordCount 似乎將其包含在頻率計數中為“S”：

詞頻分布

當然，這會導致一個問題，因為“S”被計為高頻，而所有其他詞的頻率在云中都是傾斜的：

我的傾斜云示例

在我為同一個 Moby Dick 字串遵循的教程中，WordCloud 似乎沒有計算“S”。我是否在某處遺漏了某個屬性，還是必須從字串中手動洗掉“/'s”？

以下是我的代碼的摘要：

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

uj5u.com熱心網友回復：

在這樣的應用程式中，通常stopwords首先使用過濾單詞串列，因為您不希望簡單的單詞，例如a, an, the, it, ...，支配您的結果。

稍微改了一下代碼，希望對你有幫助。您可以檢查stopwords.

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

輸出：見wordcloud Imgur

uj5u.com熱心網友回復：

看起來您的輸入是問題的一部分，如果您看起來確實如此，

corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in  corpus]
print word[215:230]

你得到

['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']

你可以做一些事情來嘗試克服這個問題，你可以過濾超過 1 的字串，

words = [word for word in corpus if len(word) > 1]

您可以嘗試 nltk 提供的不同檔案，或者您可以嘗試讀取原始輸入并正確解碼。

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/478257.html

標籤：Python matplotlib nltk

上一篇：matplotlib可以繪制遞減陣列嗎？

下一篇：Matplotlib-從3d箭袋圖中洗掉線