Python爬蟲入門教程09：多執行緒爬取表情包圖片-有解無憂

前言??

本文的文字及圖片來源于網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理，

前文內容??

Python爬蟲入門教程01：豆瓣Top電影爬取

Python爬蟲入門教程02：小說爬取

Python爬蟲入門教程03：二手房資料爬取

Python爬蟲入門教程04：招聘資訊爬取

Python爬蟲入門教程05：B站視頻彈幕的爬取

Python爬蟲入門教程06：爬取資料后的詞云圖制作

Python爬蟲入門教程07：騰訊視頻彈幕爬取

Python爬蟲入門教程08：爬取csdn文章保存成PDF

PS：如有需要 Python學習資料 以及 解答 的小伙伴可以加點擊下方鏈接自行獲取
python免費學習資料以及群交流解答點擊即可加入

基本開發環境??

Python 3.6
Pycharm
wkhtmltopdf

相關模塊的使用??

re
requests
concurrent.futures

安裝Python并添加到環境變數，pip安裝需要的相關模塊即可，

一、??明確需求

現在聊天誰還不發幾個表情包？聊天時,表情包是我們重要的工具,更是拉進小伙伴們距離的好幫手,當聊天陷入尷尬境地時,隨手一張表情包,讓尷尬化為無形

本篇文章就用python批量爬取表情包圖片，留以備用
在這里插入圖片描述

二、??網頁資料分析

在這里插入圖片描述
如圖所示斗圖網上面的圖片資料都包含在 a 標簽當中，可以嘗試直接請求這個網頁，查看response 回傳的資料當中是否也含有圖片地址，

import requests


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def main(html_url):
    response = get_response(html_url)
    print(response.text)


if __name__ == '__main__':
    url = 'https://www.doutula.com/photo/list/'
    main(url)

在輸出結果中 ctrl + F 進行搜索，
在這里插入圖片描述
這里有一個點想要注意一下，我用python請求網頁所給我們回傳的結果當中，包含圖片url地址是：
data-original="圖片url"
data-backup="圖片url"

如果想要提取url地址的話，可以用parsel 決議庫，或者 re 正則運算式，之前都是使用的parsel，本篇文章就用正則運算式吧，

urls = re.findall('data-original="(.*?)"', response.text)

??單頁爬取完整代碼

import requests
import re


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(image_url, image_name):
    image_content = get_response(image_url).content
    filename = 'images\\' + image_name
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print(image_name)


def main(html_url):
    response = get_response(html_url)
    urls = re.findall('data-original="(.*?)"', response.text)
    for link in urls:
        image_name = link.split('/')[-1]
        save(link, image_name)


if __name__ == '__main__':
    url = 'https://www.doutula.com/photo/list/'
    main(url)

??多執行緒爬取全站圖片（如果你的記憶體夠大）

在這里插入圖片描述
3631頁的資料，什么表情都有，嘿嘿嘿

import requests
import re
import concurrent.futures


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(image_url, image_name):
    image_content = get_response(image_url).content
    filename = 'images\\' + image_name
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print(image_name)


def main(html_url):
    response = get_response(html_url)
    urls = re.findall('data-original="(.*?)"', response.text)
    for link in urls:
        image_name = link.split('/')[-1]
        save(link, image_name)


if __name__ == '__main__':
    # ThreadPoolExecutor 執行緒池的物件
    # max_workers  最大任務數
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
    for page in range(1, 3632):
        url = f'https://www.doutula.com/photo/list/?page={page}'
        # submit  往執行緒池里面添加任務
        executor.submit(main, url)
    executor.shutdown()

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/253394.html

標籤：其他

上一篇：C語言經典題練習（投票統計功能、列印平行四邊形、控制臺列印楊輝三角）！

下一篇：Python爬蟲新手入門教學（七）：爬取騰訊視頻彈幕