準備作業

使用的環境

Python 3.8 解釋器
Pycharm 編輯器

需要手動安裝的模塊

parsel  資料決議模塊
requests    資料請求模塊

在cmd直接pip安裝即可

采集資料部分

基本思路流程

一、資料來源分析

1、明確需求（）

采集的網站是什么?
采集的資料是什么?

2、抓包分析相關資料來源

通過瀏覽器自帶開發者工具進行抓包分析

打開開發者工具: F12 或者滑鼠右鍵點擊檢查選擇network
重繪網頁: 讓本網頁的資料內容重新加載一遍
關鍵字搜索: 通過關鍵字<要的資料>, 搜索查詢相對應的資料包

二. 代碼實作步驟

基本四大步驟

發送請求：模擬瀏覽器對于url地址發送請求
獲取資料：獲取服務器回傳回應資料
開發者工具 --> response
決議資料：提取我們想要的資料內容
評論相關資料
保存資料：把資料內容保存表格檔案里面

代碼實戰

發送請求，模擬瀏覽器對于url地址發送請求

for page in range(0, 200, 20):
    # 請求鏈接
    url = f'https://movie.douban.com/subject/35267208/comments?start={page}&limit=20&status=P&sort=new_score'
    # 偽裝模擬
    headers = {
        # User-Agent 用戶代理, 表示瀏覽器基本身份標識
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
    }
    # 發送請求
    response = requests.get(url=url, headers=headers)
    print(response)

獲取資料，獲取服務器回傳回應資料，

print(response.text)

決議資料，提取我們想要的資料內容，

把獲取下來html字串資料 <response.text>, 轉成可決議物件

selector = parsel.Selector(response.text)  
# 第一次提取, 所有div標簽
divs = selector.css('div.comment-item')
# for回圈遍歷, 把串列里面元素一個一個提取出來
for div in divs:

    name = div.css('.comment-info a::text').get()  # 昵稱
    rating = div.css('.rating::attr(title)').get()  # 推薦
    date = div.css('.comment-time::attr(title)').get()  # 時間
    area = div.css('.comment-location::text').get()  # 地區
    votes = div.css('.votes::text').get()  # 有用
    short = div.css('.short::text').get().replace('\n', '')  # 評論
    # 資料存字典里面
    dit = {
        '昵稱': name,
        '推薦': rating,
        '時間': date,
        '地區': area,
        '有用': votes,
        '評論': short,
    }

寫入資料

csv_writer.writerow(dit)
print(name, rating, date, area, votes, short)
# 代碼僅做參考，完整代碼、詳細視頻講解在這個q裙 708525271 自取即可

創建檔案物件

f = open('data10.csv', mode='a', encoding='utf-8-sig', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    '昵稱',
    '推薦',
    '時間',
    '地區',
    '有用',
    '評論',
])

寫入表頭

csv_writer.writeheader()

可視化詞云圖

代碼展示

import pandas as pd
import jieba
import wordcloud

df = pd.read_csv('data10.csv')
df.head()

info_list = df['評論'].to_list()
string = ' '.join(jieba.lcut(''.join(info_list)))
string

wc = wordcloud.WordCloud(
    width=1000,
    height=700,
    background_color='white',
    font_path='msyh.ttc',
    scale=15,
)
wc.generate(string)
wc.to_file('1.png')

evaluate_num = df['推薦'].value_counts().to_list()
evaluate_type = df['推薦'].value_counts().index.to_list()

import pyecharts.options as opts
from pyecharts.charts import Pie

data_pair = [list(z) for z in zip(evaluate_type, evaluate_num)]
data_pair.sort(key=lambda x: x[1])

c = (
    Pie(init_opts=opts.InitOpts(bg_color="#2c343c"))
    .add(
        series_name="豆瓣影評",
        data_pair=data_pair,
        rosetype="radius",
        radius="55%",
        center=["50%", "50%"],
        label_opts=opts.LabelOpts(is_show=False, position="center"),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="推薦分布",
            pos_left="center",
            pos_top="20",
            title_textstyle_opts=opts.TextStyleOpts(color="#fff"),
        ),
        legend_opts=opts.LegendOpts(is_show=False),
    )
    .set_series_opts(
        tooltip_opts=opts.TooltipOpts(
            trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"
        ),
        label_opts=opts.LabelOpts(color="rgba(255, 255, 255, 0.3)"),
    )
)
c.render_notebook()

效果展示

好了今天的分享就到這，大家快去試試吧，下次見！

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/543257.html

標籤：Python

上一篇：03-Pandas詳解