





爬取唐人街探案3短評程序
要爬取的URL:
https://movie.douban.com/subject/27619748/comments?start=20&limit=20&status=P&sort=new_score

url = 'https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P % (movie_id, (i - 1) * 20)
其中i代表當前頁碼,從0開始,
在谷歌瀏覽器中按F12進入開發者除錯模式,查看源代碼,找到短評的代碼位置,查看位于哪個div,哪個標簽下

分析原始碼
可以看到評論在div[id=‘comments’]下的div[class=‘comment-item’]中的第一個span[class=‘short’]中,使用正則運算式提取短評內容,即代碼為:
url = 'https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P' \
% (movie_id, (i - 1) * 20)
req = requests.get(url, headers=headers)
req.encoding = 'utf-8'
comments = re.findall('<span class="short">(.*)</span>', req.text)
使用jieba分詞,jieba按照中文習慣把很多文字進行分詞
with open(file_name, 'r', encoding='utf8') as f:
word_list = jieba.cut(f.read())
result = " ".join(word_list) # 分詞用 隔開
生成wordcloud詞云:
if icon_name is not None and len(icon_name) > 0:
gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name=pic)
else:
gen_stylecloud(text=result, font_path='simsun.ttc', output_name=pic)
完整代碼:
# 分析豆瓣唐探3的影評,生成詞云
# https://movie.douban.com/subject/27619748/comments?start=20&limit=20&status=P&sort=new_score
# url = 'https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P '\
# % (movie_id, (i - 1) * 20)
import requests
from stylecloud import gen_stylecloud
import jieba
import re
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
def jieba_cloud(file_name, icon):
with open(file_name, 'r', encoding='utf8') as f:
word_list = jieba.cut(f.read())
result = " ".join(word_list) # 分詞用 隔開
# 制作中文詞云
icon_name = " "
if icon == "1":
icon_name = ''
elif icon == "2":
icon_name = "fas fa-dragon"
elif icon == "3":
icon_name = "fas fa-dog"
elif icon == "4":
icon_name = "fas fa-cat"
elif icon == "5":
icon_name = "fas fa-dove"
elif icon == "6":
icon_name = "fab fa-qq"
pic = str(icon) + '.png'
if icon_name is not None and len(icon_name) > 0:
gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name=pic)
else:
gen_stylecloud(text=result, font_path='simsun.ttc', output_name=pic)
return pic
# 爬取短評
def spider_comment(movie_id, page):
comment_list = []
with open("douban.txt", "a+", encoding='utf-8') as f:
for i in range(1,page+1):
url = 'https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P' \
% (movie_id, (i - 1) * 20)
req = requests.get(url, headers=headers)
req.encoding = 'utf-8'
comments = re.findall('<span class="short">(.*)</span>', req.text)
f.writelines('\n'.join(comments))
print(comments)
# 主函式
if __name__ == '__main__':
movie_id = '27619748'
page = 10
spider_comment(movie_id, page)
jieba_cloud("douban.txt", "1")
jieba_cloud("douban.txt", "2")
jieba_cloud("douban.txt", "3")
jieba_cloud("douban.txt", "4")
jieba_cloud("douban.txt", "5")
jieba_cloud("douban.txt", "6")
生成的 douban.txt (部分):

生成的詞云:






雖然看似有點炫,然而無用詞太多,需要經過詞云清洗才能得到有用的資訊,開篇是經過清洗和定制后的效果圖,具體方法參見:
Python爬取你好李煥英豆瓣短評并利用stylecloud制作更酷炫的詞云圖

轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/263451.html
標籤:python
下一篇:Python演算法的分享(一)
