利用Python分析文章詞頻，并生成詞云圖-有解無憂

利用Python分析文章詞頻，并生成詞云圖

使用request模塊獲取文章資料

很多人學習python，不知道從何學起，
很多人學習python，掌握了基本語法過后，不知道在哪里尋找案例上手，
很多已經做案例的人，卻不知道如何去學習更加高深的知識，
那么針對這三類人，我給大家提供一個好的學習平臺，免費領取視頻教程，電子書籍，以及課程的源代碼！
QQ群：961562169

import jieba
import requests
import csv
from bs4 import BeautifulSoup
import re

# 字符集
r = '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
# 文章鏈接：在中國日報網，選取了一篇新聞作為本次案例的分析
url = 'http://www.chinadaily.com.cn/a/202008/20/WS5f3db65da31083481726171e.html'  #英語新聞URL
headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 QIHU 360SE'
}
response = requests.get(url=url, headers=headers)  #獲取回應內容
print(response)  #<Response [200]>

/n","classes":[]}" data-cke-widget-upcasted="1" data-cke-widget-keep-attr="0" data-widget="codeSnippet"><Response [200]>

使用BeautifulSoup模塊決議所需的文章內容和詞頻統計

基本思路

使用BeautifulSoup模塊獲取文章內容
去除文章中的字符集
消除大小寫字母的影響
把英文文章的每個單詞放到串列里，并統計串列長度；
遍歷串列，對每個單詞出現的次數進行統計，并將結果存盤在字典中；
求出每個單詞出現的頻率，并將結果存盤在頻率字典中；
以字典鍵值對的“值”為標準，對字典進行排序，輸出結果

# 轉化為Beatifulsoup格式
bs = BeautifulSoup(response.text, 'html.parser')
print(type(bs))  #<class 'bs4.BeautifulSoup'>
# 提取資料
data = https://www.cnblogs.com/41280a/archive/2020/09/24/bs.find('div', class_='lft_art')
# 文章標題
title = data.find('h1').text
print(title)
# print(data.text)
# 獲取文章內容
print(len('SONG CHEN/CHINA DAILY \n \n\n'))  #長度：26
# 洗掉不屬于文章的內容：'SONG CHEN/CHINA DAILY \n \n\n'
text = data.find('div', id='Content').text.strip()[26:]
# print(text)

# 替換特殊字符方法一
for ch in '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+':
    text = text.replace(ch, ' ')
# print(text)

# 替換特殊字符方法二，正則運算式re.sub()函式
text = re.sub(r'[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+', ' ', text)
# print(text)

# 消除大寫字母帶來的影響
text = text.lower()
# print(text)
# 使用split方法切分文章
words_list = text.split()
# print(words_list)
# 新建counts串列用于存放各詞出現的次數
counts = {}
# 統計各詞出現的次數
for word in words_list:
    counts[word] = counts.get(word, 0) + 1
    #print(counts.get(word,0)+1)
# print(counts)
items = list(counts.items())  #轉為串列格式
# 按降序進行排序
items.sort(key=lambda x: x[1], reverse=True)
# 顯示出現次數最多的前15單詞
print(items[0:15])

<class 'bs4.BeautifulSoup'>
Life as we know it has changed, possibly forever
26
[('the', 83), ('to', 65), ('of', 51), ('and', 39), ('in', 33), ('a', 33), ('said', 23), ('people', 21), ('that', 19), ('be', 18), ('for', 16), ('are', 15), ('from', 15), ('they', 15), ('have', 14)]

制作詞云圖

1、詞云圖制作前，需要先準備幾個東西：

（1）下載python wordcloud庫，也是詞圖庫制作的關鍵庫；

（2）numpy庫，用于圖片處理，將圖片讀取后決議成陣列；

（3）如果要對中文句子進行分詞，那么需要jieba庫；如果是英文分詞，那可以不下載；

（4）如果要在界面上直接展示詞云圖，那么需要matlplotlib來畫圖；

（5）要處理圖片，根據少不了PIL，畢竟它可是官方的影像處理庫；

# 生成詞云圖
import matplotlib as plt
from wordcloud import wordcloud
from PIL import Image
import numpy as np

cut_text = jieba.cut(text)
result = ' '.join(cut_text)
# print(result)
mask = np.array(Image.open('./運動.jpg')) #決議圖片
wc = wordcloud.WordCloud(
    background_color='white',  # 背景顏色
    width=1000,
    height=600,
    max_font_size=50,  # 字體大小
    min_font_size=10,
    mask=mask,  # 背景圖片
    max_words=1000)
wc.generate(result)
image = wc.to_image()
image.show()  # 顯示詞云

# 保存圖片
wc.to_file('jiab_englist11.png')

/n","classes":[]}" data-cke-widget-upcasted="1" data-cke-widget-keep-attr="0" data-widget="codeSnippet"><wordcloud.wordcloud.WordCloud at 0x1feb0c0f6d8>

在這里插入圖片描述 ?

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/119949.html

標籤：其他

上一篇：如何通過Python暴力破解網站登陸密碼

下一篇：利用Python 多協程和佇列爬取豆瓣圖書