python 協程第4課，目標資料源為 mp3 ，目標站點為 bensound.com-有解無憂

本篇博客是第四遍學習協程相關知識，我們在之前內容積累的基礎上，新增加一個異步請求庫，該庫名稱為 aiohttp，

為了給大家演示 aiohttp 如何與 asyncio 進行搭配，本文采用代碼對比形式進行展示，

異步協程主要用于提高 I/O 操作的效率，所以本次采集的站點依舊以圖片為主，

初識 aiohttp 庫 - 案例為網易公開課

aiohttp 是一個異步的 HTTP 客戶端/服務端框架，它基于 asyncio 模塊實作，在《爬蟲 120 例》專欄中主要用其客戶端能力，提高爬蟲的采集速度，

接下來我們將該庫與 requests 模塊進行對比學習，

requests 模塊同步采集 20 次網易公開課

import requests
import time


def get_html():
    res = requests.get("https://open.163.com/")
    print(len(res.text))


start_time = time.perf_counter()
for i in range(20):
    get_html()

print("requests 同步采集消耗時間為：", time.perf_counter() - start_time)
# requests 同步采集消耗時間為： 4.193098181

aiohttp 庫+asyncio 異步采集 20 次網易公開課

import time

import asyncio
import aiohttp


async def get_html():
    async with aiohttp.request('GET', "https://open.163.com/") as res:
        return await res.text()


async def main():
    tasks = [asyncio.ensure_future(get_html()) for i in range(20)]

    dones, pendings = await asyncio.wait(tasks)
    for task in dones:
        print(len(task.result()))


if __name__ == '__main__':
    start_time = time.perf_counter()
    asyncio.run(main())
    print("aiohttp 異步采集消耗時間為：", time.perf_counter() - start_time)
    # aiohttp 異步采集消耗時間為： 0.275251032

得到的結論 requests 模塊采集 30 遍耗時 4s，而 aiohttp 庫耗時 0.3s，相差將近 10 倍，

aiohttp 系統學習直接參考官方檔案即可，非常清楚：https://docs.aiohttp.org/en/stable/，注意該模塊需要安裝，非內置模塊，

在本系列專欄中，aiohttp 只會用在客戶端，所以僅說明該部分知識點，

請求一個網站，并回傳其資料

import aiohttp
import asyncio

async def main():

    async with aiohttp.ClientSession() as session:
        async with session.get("http://httpbin.org/get") as resp:
            print(resp.status)
            print(await resp.text())

asyncio.run(main())

在 main() 函式中，存在兩個物件，第一個是 ClientSession，第二個沒有顯式標記，它是 ClientResponse，這兩個物件分別對應 請求物件 與 回應物件 ，

學習 aiohttp 可以對比 requests 進行學習，例如 ClientSession 物件具備不同的 HTTP 請求方法，分別是 get，post，put，post，delete，head，options，patch，其中主要用 get 與 post，

如果你不需要保留請求的會話狀態，直接用下述代碼即可，通過 aiohttp.request 直接發送請求獲取回應，

import aiohttp
import asyncio

async def main():
    async with aiohttp.request("GET", "http://httpbin.org/get") as resp:
        html = await resp.text(encoding="utf-8")
        print(html)


asyncio.run(main())

使用 ClientSession 的好處不用每次請求都創建一個 session，通過第一次創建的 session 物件可以執行所有的請求，

所以在本文的開篇代碼，可進行如下修改，不過時間上并無太大變化，

import time

import asyncio
import aiohttp


async def get_html(client):
    async with client.get("https://open.163.com/") as resp:
        return await resp.text()


async def main():
    async with aiohttp.ClientSession() as client:
        tasks = [asyncio.ensure_future(get_html(client)) for i in range(20)]

        dones, pendings = await asyncio.wait(tasks)
        for task in dones:
            print(len(task.result()))


if __name__ == '__main__':
    start_time = time.perf_counter()
    asyncio.run(main())
    print("aiohttp 異步采集消耗時間為：", time.perf_counter() - start_time)

如果希望請求到圖片類二進制資料，將上述代碼中 await resp.text() 部分，修改為 await resp.read() 即可，
如果目標資料源是 JSON 格式的資料，使用 resp.json() 即可，

aiohttp 發送請求時的引數說明

由于不同的請求方式，引數差不多，所以下述內容都使用 get 請求進行說明，

params：該引數用于構造 URL，可以傳遞的格式有 [("var1",1),("var2",2)]，{"var1": 1,"var2": 2}，var1=1&var2=2；
headers：請求頭；
cookies：請求時攜帶的 Cookie；
data：用于 POST 請求，引數格式 {"var1": 1,"var2": 2}；
timeout：超時設定；
proxy：代理設定；

到這里，初識部分已經說明完畢，接下來就進入到實際的編碼環節，

bensound 爬蟲撰寫

本次要采集的目標站點是：https://www.bensound.com/royalty-free-music，
該頁面包含非常多的 mp3 檔案，本篇博客就對其進行采集，
python 協程第4課，目標資料源為 mp3 ，目標站點為 bensound.com
經過分析得知，mp3 的下載地址是：

https://www.bensound.com/bensound-music/bensound-allthat.mp3

該地址可以通過串列頁相關資料拼湊而來，通過開發者工具得到 mp3 如下封面圖地址，再通過 python 字串操作，獲取上述鏈接，

https://www.bensound.com/bensound-img/allthat.jpg

轉換代碼如下：

img_url = "https://www.bensound.com/bensound-img/allthat.jpg"
name = img_url[img_url.rfind("/") + 1:img_url.rfind(".")]

mp3_url = f"https://www.bensound.com/bensound-music/bensound-{name}.mp3"
print(mp3_url)

轉換代碼撰寫完畢，先測驗一下通過 requests 模塊獲取 20 頁資料消耗的時間，

import time

import asyncio
import aiohttp

from bs4 import BeautifulSoup
import lxml


async def get_html(client, url):
    print("正在采集", url)
    async with client.get(url) as resp:
        html = await resp.text()
        soup = BeautifulSoup(html, 'lxml')
        divs = soup.find_all(attrs={'class': 'img_mini'})
        mp3_urls = [get_mp3_url("https://www.bensound.com/" + div.a.img["src"]) for div in divs]
        return mp3_urls


def get_mp3_url(img_url):
    img_url = img_url
    name = img_url[img_url.rfind("/") + 1:img_url.rfind(".")]

    mp3_url = f"https://www.bensound.com/bensound-music/bensound-{name}.mp3"
    return mp3_url


async def main(urls):
    async with aiohttp.ClientSession() as client:
        tasks = [asyncio.ensure_future(get_html(client, urls[i])) for i in range(len(urls))]

        dones, pendings = await asyncio.wait(tasks)
        print("異步執行完畢，開始輸出對應結果：")
        for task in dones:
            print(task.result())


if __name__ == '__main__':
    url_format = "https://www.bensound.com/royalty-free-music/{}"
    urls = [url_format.format(i) for i in range(1, 21)]
    start_time = time.perf_counter()
    asyncio.run(main(urls))
    print("aiohttp 異步采集消耗時間為：", time.perf_counter() - start_time)

上述代碼，運行程序如下所示，
python 協程第4課，目標資料源為 mp3 ，目標站點為 bensound.com
接下來的代碼就變得非常簡單了，與前一篇博客內容基本一致，

import time

import asyncio
import aiohttp

from bs4 import BeautifulSoup
import lxml


async def get_html(client, url):
    print("正在采集", url)
    async with client.get(url, timeout=5) as resp:
        html = await resp.text()
        soup = BeautifulSoup(html, 'lxml')
        divs = soup.find_all(attrs={'class': 'img_mini'})
        mp3_urls = [get_mp3_url("https://www.bensound.com/" + div.a.img["src"]) for div in divs]
        return mp3_urls


def get_mp3_url(img_url):
    img_url = img_url
    name = img_url[img_url.rfind("/") + 1:img_url.rfind(".")]

    mp3_url = f"https://www.bensound.com/bensound-music/bensound-{name}.mp3"
    return mp3_url


async def get_mp3_file(client, url):
    print("正在采集 mp3 檔案", url)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36",
        "Referer": "https://www.bensound.com/royalty-free-music"
    }
    mp3_file_name = url[url.rfind('-') + 1:url.rfind('.')]
    print(mp3_file_name)
    async with client.get(url, headers=headers) as resp:
        content = await resp.read()
        with open(f'./mp3/{mp3_file_name}.mp3', "wb") as f:
            f.write(content)
        return (url, "success")


async def main(urls):
    timeout = aiohttp.ClientTimeout(total=600)  # 超時時間設定為600秒
    connector = aiohttp.TCPConnector(limit=50)  # 并發數量設定為50
    async with aiohttp.ClientSession(connector=connector, timeout=timeout) as client:
        tasks = [asyncio.ensure_future(get_html(client, urls[i])) for i in range(len(urls))]

        dones, pendings = await asyncio.wait(tasks)
        print("異步執行完畢，開始輸出對應結果：")
        all_mp3 = []
        for task in dones:
            all_mp3.extend(task.result())

        totle = len(all_mp3)
        print("累計獲取到【", totle, "】個 MP3 檔案")
        print("_" * 100)
        print("準備下載 MP3 檔案")

        # 每次下載10個
        totle_page = totle // 10 if totle % 10 == 0 else totle // 10 + 1

        for page in range(0, totle_page):
            print("正在下載第{}頁 MP3 檔案".format(page + 1))
            start_page = 0 if page == 0 else page * 10
            end_page = (page + 1) * 10
            print("待下載地址")
            print(all_mp3[start_page:end_page])
            mp3_download_tasks = [asyncio.ensure_future(get_mp3_file(client, url)) for url in
                                  all_mp3[start_page:end_page]]
            mp3_dones, mp3_pendings = await asyncio.wait(mp3_download_tasks)
            for task in mp3_dones:
                print(task.result())


if __name__ == '__main__':
    url_format = "https://www.bensound.com/royalty-free-music/{}"
    urls = [url_format.format(i) for i in range(1, 5)]
    start_time = time.perf_counter()
    asyncio.run(main(urls))
    print("aiohttp 異步采集消耗時間為：", time.perf_counter() - start_time)

運行截圖如下所示，由于 mp3 檔案比較大，所以將采集總頁數設定為 5 ，

python 協程第4課，目標資料源為 mp3 ，目標站點為 bensound.com
上述代碼還進行了 ClientSession 的全域設定，代碼如下，

timeout = aiohttp.ClientTimeout(total=600)  # 超時時間設定為600秒
connector = aiohttp.TCPConnector(limit=50)  # 并發數量設定為50

設定上述引數的原因，由于部分網站的服務器限制單個 IP 建立并行 TCP 連接數量，aiohttp 默認設定連接數量為 100，可以手動調整，
超時設定也是由于 aiohttp 默認設定的是 300S（即 5 分鐘），如果一個 TCP 連接的持續時間超過這個時間，服務器自動斷開該連接，

寫在后面

如需完整代碼，請查看評論區置頂評論，

今天是持續寫作的第 244 / 365 天，
期待關注，點贊、評論、收藏，

更多精彩

《爬蟲 100 例，專欄銷售中，買完就能學會系列專欄》
python 協程第4課，目標資料源為 mp3 ，目標站點為 bensound.com

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/328006.html

標籤：python

上一篇：自動化快速上手--Python(5)--【元組】--每天半小時

下一篇：【Pygame實戰】風靡全球的切水果游戲升級版“水果忍者”上線啦！你敢來PK嘛？！