前言

朋友最近在公眾號發一些好書好劇推薦，然后我想著幫幫忙，做一個書評的詞云，說不定會有效果，說干就干，在網上收集資料，結合自己的需求終于還是完成了呀！

1、完成目標：

??爬取書評或者影評然后獲取其評論詞，并制作詞云，如圖所示
在這里插入圖片描述
2、內容預告

本文設計到知識點較多，在代碼中會有注釋，這里簡單羅列一下
（1）python對字串的處理（洗掉自己不想要的東西）——re庫的使用
（2）python對檔案的讀寫操作
（3）python對各型別資料包括字串、串列、字典等的轉換等處理
（4）爬蟲相關內容
（5）jieba（結巴，哈哈哈）實用分詞庫的使用
（6）wordcloud 詞云生成庫的使用

一、準備

1、python環境

2、涉及到的python庫需要 pip install 包名 安裝

pip install jieba
pip install wordcloud
(文章涉及到的其他庫如果沒有也需要安裝)

二、匯入庫

import jieba
from wordcloud import WordCloud

注意： 我在匯入wordcloud的時候犯了一個哭笑不得的錯誤，錯誤提示是 ImportError：cannot import name ‘WordCloud’ ，當時感覺巨奇怪，怎么我的的電腦和別人的不一樣？？？
在這里插入圖片描述
后來發現原來是我把我的python檔案名寫的是“wordcloud.Py”結果自然會出問題，解決方法就是把檔案名改了…

三、基本功能實作

實作對給定文本制作詞云

# 簡單對一定的文本制作詞云(將所需文本放入wordcloudtext.txt檔案中)
import re
import jieba
from wordcloud import WordCloud
import numpy
from PIL import Image

#創建詞云
def create_wordcloud(content,savename):
    mask = numpy.array(Image.open("ball.jpg"))  #配置一個mask引數，生成該圖片形狀的詞云
    contents = ''.join(content)   #拼接所給的內容，如果所給的是串列那么將串列中的內容拼接起來，如果是字典那么拼接其所有鍵
    content_cut = jieba.cut(contents,cut_all=False)   #jieba.cut用來分詞，cut_all引數用來控制全模式或者精確模式分詞
    content_space_split = ' '.join(content_cut)   #用空格將分詞結果拼接起來
    result = WordCloud('simhei.ttf',
                   mask = mask,
                   background_color='white', # 背景顏色
                   width=1000,
                   height=600,).generate(content_space_split)#創建詞云
    result.to_file('%s.png'%savename)  #將詞云保存為圖片

#洗掉文本中的非中文部分
def find_chinese(file):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese = re.sub(pattern, '', file)
    #print(chinese)
    return chinese

if __name__ == '__main__':
    with open ('D:\\ryc\python_learning\other\\3_wordcloud\wordcloudtext.txt','r') as f:#讀取.txt檔案內容（將想要制作詞云的文本內容放入該文本檔案中）
        content = f.read()
        content = find_chinese(content)
        content = re.sub('[我你他的了但是就還要不會那在有都才看也又太像可中卻很到對時候能這而當沒]','',content)  #去除文本中我你他之類的不想要的高頻詞
        print(content)
    create_wordcloud(content,'詞云評論')

四、爬取書評并制作詞云

import requests
import re
import jieba
from wordcloud import WordCloud
import numpy
from PIL import Image

#洗掉文本中的非中文部分
def find_chinese(file):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese = re.sub(pattern, '', file)
    return chinese

#爬取小王子的短評內容
def spider_xiaowangzi():
    commentres = ''
    with open ('D:\\ryc\python_learning\other\\3_wordcloud\spider_wordcloud.txt','w',encoding='utf-8') as f:
        url = 'https://book.douban.com/subject/1084336/comments/?percent_type=h&limit=20&status=P&sort=new_score'  #爬取目標地址
        header = {
            'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
            'Cookie' : 'll="118100"; bid=gr9hyjlFAIs; __utmz=30149280.1586961843.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _vwo_uuid_v2=DAACAB21E936827CFA01C7ADE5CAF4293|cc739421584c029e0df955dae135ef07; __gads=ID=2846de2c51666e22:T=1587043991:S=ALNI_MaF-A9RTL8744UwEClUMK5nqOC8nw; _ga=GA1.2.507836951.1586961843; gr_user_id=3aa1d1e6-1a34-4b05-82ef-8066755b0ca9; __utmz=81379588.1587383782.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __yadk_uid=Ot4d328rVsOjtT0KtAdZE0rJH9jhIhlq; viewed="1007305"; ap_v=0,6.0; __utmc=30149280; __utma=30149280.507836951.1586961843.1588474302.1588476961.13; __utmt_douban=1; __utmb=30149280.1.10.1588476961; __utma=81379588.507836951.1586961843.1587445404.1588476961.5; __utmc=81379588; __utmt=1; __utmb=81379588.1.10.1588476961; _pk_id.100001.3ac3=bc45a2b0c4ddf6ec.1587383783.5.1588476961.1587445427.; _pk_ses.100001.3ac3=*'
        }  #帶上請求頭爬取才不至于被攔
        try :
            data = requests.get(url,headers = header).text
        except:
            print('爬取失敗')
            exit ()
        #從爬取的data中決議出該部分內容（結果是一個串列）
        comment = re.findall('<span class="short">(.*?)</span>',data)  #<span class="short">十幾歲的時候渴慕著小王子，一天之間可以看四十四次日落，是在多久之后才明白，看四十四次日落的小王子，他有多么難過，</span>

        for i in range(0,len(comment)):
            commentres = commentres + comment[i]   #將串列轉換為一個完整的字串
        commentres = find_chinese(commentres)      #去除其中的非中文部分
        commentres = re.sub('[我你他的了但是就還要不會那在有都才看也又太像可中卻很說到對]','',commentres)  #去除文本中我你他之類的你不想要的高頻詞
            
        f.write("{duanpin}\n".format(duanpin = commentres)) #將結果寫入.txt檔案中
        #print (commentres)
        return commentres
        

def create_wordcloud(content,savename):
    mask = numpy.array(Image.open("ball.jpg"))    #配置一個mask引數，生成該圖片形狀的詞云
    contents = ''.join(content)   #拼接所給的內容，如果所給的是串列那么將串列中的內容拼接起來，如果是字典那么拼接其所有鍵
    content_cut = jieba.cut(contents,cut_all=False)   #jieba.cut用來分詞，cut_all引數用來控制全模式（True）或者精確模式分詞（False）
    content_space_split = ' '.join(content_cut)    #用空格將分詞結果拼接起來
    result = WordCloud('simhei.ttf',                
                   mask = mask,
                   background_color='white', # 背景顏色
                   width=1000,
                   height=600,).generate(content_space_split) #創建詞云
    result.to_file('%s.png'%savename)  #將詞云保存為圖片

if __name__ == "__main__":
    comment = spider_xiaowangzi()
    create_wordcloud(comment,'小王子詞云評論')

最后

類似有意思的python應用我會持續更新，有興趣的小伙伴可以關注我，以及時獲取更新內容哦！
（都看到這里，點個贊再走吧，創作不易！）

其他python應用實體見：https://blog.csdn.net/weixin_45386875/article/details/113766276

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/259580.html

標籤：python

上一篇：Python 操作 Excel 學習筆記

下一篇：c++的下列宣告什么意思？template class I_List<THD>;