Python爬蟲實戰，requests+tqdm模塊，爬取漫畫之家漫畫資料（附原始碼）-有解無憂

前言

今天給大家介紹的是Python爬取漫畫資料，在這里給需要的小伙伴們代碼，并且給出一點小心得，

首先是爬取之前應該盡可能偽裝成瀏覽器而不被識別出來是爬蟲，基本的是加請求頭，但是這樣的純文本資料爬取的人會很多，所以我們需要考慮更換代理IP和隨機更換請求頭的方式來對漫畫資料進行爬取，

在每次進行爬蟲代碼的撰寫之前，我們的第一步也是最重要的一步就是分析我們的網頁，

通過分析我們發現在爬取程序中速度比較慢，所以我們還可以通過禁用谷歌瀏覽器圖片、JavaScript等方式提升爬蟲爬取速度，

開發工具

Python版本： 3.6

相關模塊：

requests模塊

re模塊

time模塊

bs4模塊

tqdm模塊

contextlib模塊

環境搭建

安裝Python并添加到環境變數，pip安裝需要的相關模塊即可，

文中完整代碼及檔案，評論留言獲取

思路分析

瀏覽器中打開我們要爬取的頁面
按F12進入開發者工具，查看我們想要的漫畫資料在哪里
這里我們需要頁面資料就可以了

源代碼結構

添加代理

在這里插入圖片描述

漫畫下載代碼實作

# 下載漫畫
for i, url in enumerate(tqdm(chapter_urls)):
    print(i,url)
    download_header = {
        'Referer':url,
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    name = chapter_names[i]
    # 去掉.
    while '.' in name:
        name = name.replace('.', '')
    chapter_save_dir = os.path.join(save_dir, name)
    if name not in os.listdir(save_dir):
        os.mkdir(chapter_save_dir)
    r = requests.get(url=url)
    html = BeautifulSoup(r.text, 'lxml')
    script_info = html.script
    pics = re.findall('\d{13,14}', str(script_info))
    for j, pic in enumerate(pics):
        if len(pic) == 13:
            pics[j] = pic + '0'
    pics = sorted(pics, key=lambda x: int(x))
    chapterpic_hou = re.findall('\|(\d{5})\|', str(script_info))[0]
    chapterpic_qian = re.findall('\|(\d{4})\|', str(script_info))[0]
    for idx, pic in enumerate(pics):
        if pic[-1] == '0':
            url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic[
                                                                                                             :-1] + '.jpg'
        else:
            url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic + '.jpg'
        pic_name = '%03d.jpg' % (idx + 1)
        pic_save_path = os.path.join(chapter_save_dir, pic_name)
        print(url)
        response = requests.get(url,headers=download_header)
        # with closing(requests.get(url, headers=download_header, stream=True)) as response:
            # chunk_size = 1024
            # content_size = int(response.headers['content-length'])
        print(response)
        if response.status_code == 200:
            with open(pic_save_path, "wb") as file:
                # for data in response.iter_content(chunk_size=chunk_size):
                    file.write(response.content)
        else:
            print('鏈接例外')
    time.sleep(2)

資料保存

結果展示

資料保存

結果展示

最后

今天的分享到這里就結束了，感興趣的朋友也可以去試試哈

對文章有問題的，或者有其他關于python的問題，可以在評論區留言或者私信我哦

覺得我分享的文章不錯的話，可以關注一下我，或者給文章點贊(/≧▽≦)/

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/540775.html

標籤：Python

上一篇：Java基礎類String學習分析

下一篇：詳解JAVA執行緒問題診斷工具Thread Dump