Python 爬取Steam熱銷榜資訊

最近學習了一下爬蟲，練練手，第一次寫文章，請多多包涵O(∩_∩)O
爬取Steam熱銷榜：游戲排名、游戲名字、價格、好評率、游戲詳情頁面跳轉鏈接，

Steam熱銷榜爬蟲

Python 爬取Steam熱銷榜資訊
一、開始爬蟲前
- 1.引入庫
- 2.讀入資料
二、分析網頁
- 1.首先觀察價格，有兩種形式，一種是原價，一種是打折后的價格
- 2.名字、游戲詳情頁鏈接
- 3.好評率
三、爬取資訊并處理
- 1.游戲詳情頁面鏈接
- 2.好評率
- 3.名字、價格
四、資訊保存
五、主函式
六、全部代碼
七、總結
- 1.心情

一、開始爬蟲前

1.引入庫

熱銷榜為靜態頁面，所需要的庫：requests，pandas，BeautifulSoup

import requests
import pandas as pd
from bs4 import BeautifulSoup

2.讀入資料

headers里需要加入 ‘Accept-Language’: 'zh-CN '，不然回傳來的游戲名字是英文

def get_text(url):
    try:
        headers = {
            "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/85.0.4183.102 Safari/537.36', 'Accept-Language': 'zh-CN '
        }
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "爬取網站失敗！"

二、分析網頁

熱銷榜鏈接：
https://store.steampowered.com/search/?filter=globaltopsellers&page=1&os=win

1.首先觀察價格，有兩種形式，一種是原價，一種是打折后的價格

在這里插入圖片描述
（1）正常價格

（2）打折后價格，對比可發現class = “col search_discount responsive_secondrow” 是用于儲存折扣的，原價為None，打折不為None，以此做為判斷條件提取價格，

2.名字、游戲詳情頁鏈接

由代碼可以知道：
游戲詳情頁鏈接全在 id = “search_resultsRows” 的a.href 標簽里
名字在 class = “title” 里
在這里插入圖片描述

3.好評率

由圖可以發現有些游戲沒有評價
在這里插入圖片描述
查看代碼發現：
好評率在 class = “col search_reviewscore responsive_secondrow” 中，無評價為None

三、爬取資訊并處理

1.游戲詳情頁面鏈接

提取a[‘href’] 保存到 jump_link 串列

link_text = soup.find_all("div", id="search_resultsRows")
    for k in link_text:
        b = k.find_all('a')
    for j in b:
        jump_link.append(j['href'])

2.好評率

提前好評率和評價數到 game_evaluation 串列，

u.span["data-tooltip-html"].split("<br>")[0]   # 評價
例：多半好評

u.span["data-tooltip-html"].split("<br>")[-1])  # 好評率
例：276,144 篇用戶的游戲評測中有 78% 為好評，

w = soup.find_all(class_="col search_reviewscore responsive_secondrow")
    for u in w:
        if u.span is not None:					# 判斷是否評價為None，
            game_evaluation.append(
                u.span["data-tooltip-html"].split("<br>")[0] + "," + u.span["data-tooltip-html"].split("<br>")[-1])
        else:
            game_evaluation.append("暫無評價！")

3.名字、價格

用strip去除多余空格，用split切割，提取其中的價格
global num 是游戲排名，
game_info 是串列，存盤所有的資訊，

price = z.find(class_="col search_price discounted responsive_secondrow").text.strip().split("￥")
print(price)					# 切割后的打折price
print(price[2].strip())			# 需要保存的價格

結果為：
['', ' 116', ' 58']				
58

global num
name_text = soup.find_all('div', class_="responsive_search_name_combined")
for z in name_text:
    # 每個游戲的價格
    name = z.find(class_="title").string.strip()
    # 判斷折扣是否為None，提取價格
    if z.find(class_="col search_discount responsive_secondrow").string is None:
        price = z.find(class_="col search_price discounted responsive_secondrow").text.strip().split("￥")
        game_info.append([num + 1, name, price[2].strip(), game_evaluation[num], jump_link[num]])
    else:
        price = z.find(class_="col search_price responsive_secondrow").string.strip().split("￥")
        game_info.append([num + 1, name, price[1], game_evaluation[num], jump_link[num]])
    num = num + 1

四、資訊保存

def save_data(game_info):
    save_path = "E:/Steam.csv"      # 保存路徑
    df = pd.DataFrame(game_info, columns=['排行榜', '游戲名字', '目前游戲價格￥', '游戲頁面鏈接', '游戲評價'])
    df.to_csv(save_path, index=0)
    print("檔案保存成功！")

五、主函式

一頁有25個游戲，翻頁鏈接直接在原鏈接后面加 &page= 第幾頁
想爬取多少頁游戲，就把 range（1,11）的11改成多少頁+1

if __name__ == "__main__":
    Game_info = []			# 所有游戲資訊
    Turn_link = []			# 翻頁鏈接
    Jump_link = []          # 游戲詳情頁面鏈接
    Game_evaluation = []    # 游戲好評率和評價
    for i in range(1, 11):
        Turn_link.append("https://store.steampowered.com/search/?filter=globaltopsellers&page=1&os=win" + str("&page=" + str(i)))
        run(Game_info, Jump_link, Game_evaluation, get_text(Turn_link[i-1]))
    save_data(Game_info)

六、全部代碼

num 是排名；run 函式里要呼叫 BeautifulSoup 函式決議

import requests
import pandas as pd
from bs4 import BeautifulSoup
num = 0


def get_text(url):
    try:
        headers = {
            "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/85.0.4183.102 Safari/537.36', 'Accept-Language': 'zh-CN '
        }
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "爬取網站失敗！"


def run(game_info, jump_link, game_evaluation, text):
    soup = BeautifulSoup(text, "html.parser")

    # 游戲評價
    w = soup.find_all(class_="col search_reviewscore responsive_secondrow")
    for u in w:
        if u.span is not None:
            game_evaluation.append(
                u.span["data-tooltip-html"].split("<br>")[0] + "," + u.span["data-tooltip-html"].split("<br>")[-1])
        else:
            game_evaluation.append("暫無評價！")

    # 游戲詳情頁面鏈接
    link_text = soup.find_all("div", id="search_resultsRows")
    for k in link_text:
        b = k.find_all('a')
    for j in b:
        jump_link.append(j['href'])

    # 名字和價格
    global num
    name_text = soup.find_all('div', class_="responsive_search_name_combined")
    for z in name_text:
        # 每個游戲的價格
        name = z.find(class_="title").string.strip()
        # 判斷折扣是否為None，提取價格
        if z.find(class_="col search_discount responsive_secondrow").string is None:
            price = z.find(class_="col search_price discounted responsive_secondrow").text.strip().split("￥")
            game_info.append([num + 1, name, price[2].strip(), game_evaluation[num], jump_link[num]])
        else:
            price = z.find(class_="col search_price responsive_secondrow").string.strip().split("￥")
            game_info.append([num + 1, name, price[1], game_evaluation[num], jump_link[num]])
        num = num + 1


def save_data(game_info):
    save_path = "E:/Steam.csv"
    df = pd.DataFrame(game_info, columns=['排行榜', '游戲名字', '目前游戲價格￥', '游戲評價', '游戲頁面鏈接'])
    df.to_csv(save_path, index=0)
    print("檔案保存成功！")


if __name__ == "__main__":
    Game_info = []          # 游戲全部資訊
    Turn_link = []          # 翻頁鏈接
    Jump_link = []          # 游戲詳情頁面鏈接
    Game_evaluation = []    # 游戲好評率和評價
    for i in range(1, 11):
        Turn_link.append("https://store.steampowered.com/search/?filter=globaltopsellers&page=1&os=win" + str("&page=" + str(i)))
        run(Game_info, Jump_link, Game_evaluation, get_text(Turn_link[i-1]))
    save_data(Game_info)

在這里插入圖片描述

七、總結

以上就是今天要講的內容，本文僅僅簡單介紹了request 和 BeautifulSoup 的使用，

突破點在于：有些頁面資訊的位置不同，能否找出判斷條件

1.心情

初學爬蟲，也是第一次寫博客，對于優化和多執行緒目前沒有太多了解，本文的代碼可能有點繁瑣，不過也能做到輕松爬取資訊，O(∩_∩)O

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/244289.html

標籤：python

上一篇：python Django之Web框架本質 (2)

下一篇：python畫圖遇到復數值資料時應該用numpy.abs()函式還是numpy.real()函式

Python爬蟲——爬取Steam熱銷榜游戲資訊