python帶你采集當當網商品及評論資料并實作詞云圖-有解無憂

前言 ??

嗨嘍，大家好呀~這里是愛看美女的茜茜吶

本次采集網介紹：圖書頻道-全球最大中文網上書店

專業提供小說傳記,青春文學,成功勵志,投資理財等各品類圖書

暢銷榜最新報價、促銷、評論資訊,引領最新網上購書體驗!

環境使用 ??:

Python 3.8
Pycharm

模塊使用 ??:

requests >>> pip install requests
parsel >>> pip install parsel
csv

爬蟲基本思路流程 ??:

一. 資料來源分析

確定自己采集資料內容
抓包分析,自己想要資料來自哪里 ---> 請求那個url地址得到想要的資料

開發者工具抓包分析 F12 或者滑鼠右鍵點擊檢查選擇 network(網路), 重繪網頁
通過關鍵字(我們想要資料比如: 書名) 去搜索資料包是那個 ---> 確定請求是那個網址得到資料內容

請求這個網站就可以得到我們想要資料內容

二. 代碼實作步驟:

發送請求, 模擬瀏覽器對于url發送請求
獲取資料, 獲取服務器回傳回應資料 ---> 開發者工具里面response
決議資料, 提取我們想要資料內容, 書籍基本資訊
保存資料, 保存表格里面

資料采集 ??

# 匯入資料請求模塊  ---> 第三方模塊 需要 在cmd 里面 pip install requests
import requests
# 匯入資料決議模塊 ---> 第三方模塊 需要 在cmd 里面 pip install parsel
import parsel
# 匯入csv模塊 ---> 內置模塊 不需要安裝
import csv

# 創建檔案
f = open('書籍data25頁.csv', mode='a', encoding='utf-8', newline='')
# f檔案物件 fieldnames 欄位名 ---> 表格第一行 作為表頭
csv_writer = csv.DictWriter(f, fieldnames=[
    '標題',
    '評論',
    '推薦',
    '作者',
    '日期',
    '出版社',
    '售價',
    '原價',
    '折扣',
    '電子書',
    '詳情頁',
])
# 原始碼、解答、教程加Q裙：261823976
# 寫入表頭
csv_writer.writeheader()
"""
1. 發送請求, 模擬瀏覽器對于url發送請求
    - 等號左邊是定義變數名
    - 模擬瀏覽器 ---> 請求頭
        headers ---> 在開發者工具里面復制粘貼 字典資料型別
        一種簡單反反爬手段, 防止被服務器識別出來是爬蟲程式
    - 使用什么請求方式, 根據開發者工具來的
"""
for page in range(1, 26): #  1,26 是取1-25的數字, 不包含26
    # 確定請求網址
    url = f'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-{page}'
    # 模擬瀏覽器 ---> 請求頭
    headers = {
        # User-Agent 用戶代理 表示瀏覽器基本身份標識
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
    }
    # 發送請求 回傳的回應物件 ---> <Response [200]>: <> 表示物件  response 回應回復  200狀態碼 表示請求成功
    response = requests.get(url=url, headers=headers)
    print(response)
    # 2. 獲取資料, 獲取服務器回傳回應資料 ---> 開發者工具里面 response  print(response.text)
    """
    3. 決議資料, 提取我們想要資料內容, 書籍基本資訊
    根據得到資料型別以及我們想要資料內容, 選擇最適合決議方法:
        - re正則運算式
        - css選擇器
        - xpath
    xpath --->  根據標簽節點提取資料
    css選擇器 ---> 根據標簽屬性提取資料內容
        css語法匹配  不會 1  會的 2
        復制粘貼會不會 ---> ctrl + C  ctrl + v
    """
    # 轉資料型別 <Selector xpath=None data=https://www.cnblogs.com/Qqun261823976/archive/2022/07/28/'<html xmlns="http://www.w3.org/1999/x...'>
    selector = parsel.Selector(response.text)
    # 第一次提取 提取所有li標簽 --> 回傳串列, 元素Selector物件
    lis = selector.css('.bang_list_mode li')
    # for回圈遍歷 之后進行二次提取 我們想要內容
    for li in lis:
        """
        attr() 屬性選擇器 
        a::attr(title) ---> 獲取a標簽里面title屬性
        get() 獲取一個 第一個 
        """
        title = li.css('.name a::attr(title)').get()  # 標題
        star = li.css('.star a::text').get().replace('條評論', '')  # 評論
        recommend = li.css('.tuijian::text').get().replace('推薦', '')  # 推薦
        author = li.css('.publisher_info a::attr(title)').get()  # 作者
        date = li.css('.publisher_info span::text').get()  # 日期
        press = li.css('div:nth-child(6) a::text').get()  # 出版社
        price_n = li.css('.price .price_n::text').get()  # 售價
        price_r = li.css('.price .price_r::text').get()  # 原價
        price_s = li.css('.price .price_s::text').get().replace('折', '')  # 折扣
        price_e = li.css('.price .price_e .price_n::text').get()  # 電子書
        href = li.css('.name a::attr(href)').get()  # 詳情頁
        # 保存資料
        原始碼、解答、教程加Q裙：261823976
        dit = {
            '標題': title,
            '評論': star,
            '推薦': recommend,
            '作者': author,
            '日期': date,
            '出版社': press,
            '售價': price_n,
            '原價': price_r,
            '折扣': price_s,
            '電子書': price_e,
            '詳情頁': href,
        }
        # 寫入資料
        csv_writer.writerow(dit)
        print(title, star, recommend, author, date, press, price_n, price_r, price_s, price_e, href, sep=' | ')

評論 ??

# 匯入資料請求模塊
import time
import requests
import re
for page in range(1, 11):
    time.sleep(1.5)
    # 確定網址
    原始碼、解答、教程加Q裙：261823976
    url = 'http://product.dangdang.com/index.php'
    # 請求引數
    data = {
        'r': 'comment/list',
        'productId': '27898031',
        'categoryPath': '01.43.77.07.00.00',
        'mainProductId': '27898031',
        'mediumId': '0',
        'pageIndex': page,
        'sortType': '1',
        'filterType': '1',
        'isSystem': '1',
        'tagId': '0',
        'tagFilterCount': '0',
        'template': 'publish',
        'long_or_short': 'short',
    }
    headers = {
        'Cookie': '__permanent_id=20220526142043051185927786403737954; dest_area=country_id%3D9000%26province_id%3D111%26city_id%20%3D0%26district_id%3D0%26town_id%3D0; ddscreen=2; secret_key=f4022441400c500aa79d59edd8918a6e; __visit_id=20220723213635653213297242210260506; __out_refer=; pos_6_start=1658583812022; pos_6_end=1658583812593; __trace_id=20220723214559176959858324136999851; __rpm=p_27898031.comment_body..1658583937494%7Cp_27898031.comment_body..1658583997600',
        'Host': 'product.dangdang.com',
        'Referer': 'http://product.dangdang.com/27898031.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36',
    }
    response = requests.get(url=url, params=data, headers=headers)
    html_data = response.json()['data']['list']['html']
    content_list = re.findall("<span><a href='https://www.cnblogs.com/Qqun261823976/archive/2022/07/28/.*?' target='_blank'>(.*?)</a></span>", html_data)
    for content in content_list:
        with open('評論.txt', mode='a', encoding='utf-8') as f:
            f.write(content)
            f.write('\n')
        print(content)

詞云圖 ??

import jieba
import wordcloud
import imageio
# 讀取圖片
py = imageio.imread('python.png')
# 打開檔案
f = open('評論.txt', encoding='utf-8')
# 讀取內容
txt = f.read()
# jieba模塊進行分詞  ---> 串列
txt_list = jieba.lcut(txt)
print(txt_list)
# join把串列合成字串
string = ' '.join(txt_list)
# 使用詞云庫
wc = wordcloud.WordCloud(
    height=300,  # 高度
    width=500,  # 寬度
    background_color='white',  # 背景顏色
    font_path='msyh.ttc',  # 字體
    scale=15, # 輪廓
    stopwords={'的', '了', '很', '也'},  # 停用詞
    mask=py  # 自定義詞云圖樣式
)
wc.generate(string)  # 需要做詞云資料傳入進去
wc.to_file('1.png')  # 輸入圖片

尾語 ??

感謝你觀看我的文章吶~本次航班到這里就結束啦 ??

希望本篇文章有對你帶來幫助 ??，有學習到一點知識~

躲起來的星星??也在努力發光，你也要努力加油（讓我們一起努力叭），

最后，博主要一下你們的三連呀（點贊、評論、收藏），不要錢的還是可以搞一搞的嘛~

不知道評論啥的，即使扣個6666也是對博主的鼓舞吖 ?? 感謝 ??

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/500523.html

標籤：其他

上一篇：Mybatis基礎知識大全！！！

下一篇：Java Bean 轉 Map 的巨坑，注意了！！！