前言
今天手把手帶領大家用Python實作爬取漫畫圖片,帶領大家解決遇到的反爬,動態加載等問題.
知識點:
- tqdm
- requests
- BeautifulSoup
- 多執行緒
- JavaScript動態加載
開發環境:
- Python 3.6
- Pycharm
目標地址
https://www.dmzj.com/info/yaoshenji.html
代碼
匯入工具
import requests
import os
import re
from bs4 import BeautifulSoup
from contextlib import closing
from tqdm import tqdm
import time
獲取動漫章節鏈接和章節名
r = requests.get(url=target_url)
bs = BeautifulSoup(r.text, 'lxml')
list_con_li = bs.find('ul', class_="list_con_li")
cartoon_list = list_con_li.find_all('a')
chapter_names = []
chapter_urls = []
for cartoon in cartoon_list:
href = https://www.cnblogs.com/hhh188764/p/cartoon.get('href')
name = cartoon.text
chapter_names.insert(0, name)
chapter_urls.insert(0, href)
print(chapter_urls)
下載漫畫
for i, url in enumerate(tqdm(chapter_urls)):
print(i,url)
download_header = {
'Referer':url,
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
name = chapter_names[i]
# 去掉.
while '.' in name:
name = name.replace('.', '')
chapter_save_dir = os.path.join(save_dir, name)
if name not in os.listdir(save_dir):
os.mkdir(chapter_save_dir)
r = requests.get(url=url)
html = BeautifulSoup(r.text, 'lxml')
script_info = html.script
pics = re.findall('\d{13,14}', str(script_info))
for j, pic in enumerate(pics):
if len(pic) == 13:
pics[j] = pic + '0'
pics = sorted(pics, key=lambda x: int(x))
chapterpic_hou = re.findall('\|(\d{5})\|', str(script_info))[0]
chapterpic_qian = re.findall('\|(\d{4})\|', str(script_info))[0]
for idx, pic in enumerate(pics):
if pic[-1] == '0':
url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic[
:-1] + '.jpg'
else:
url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic + '.jpg'
pic_name = '%03d.jpg' % (idx + 1)
pic_save_path = os.path.join(chapter_save_dir, pic_name)
print(url)
response = requests.get(url,headers=download_header)
# with closing(requests.get(url, headers=download_header, stream=True)) as response:
# chunk_size = 1024
# content_size = int(response.headers['content-length'])
print(response)
if response.status_code == 200:
with open(pic_save_path, "wb") as file:
# for data in response.iter_content(chunk_size=chunk_size):
file.write(response.content)
else:
print('鏈接例外')
time.sleep(2)
創建保存目錄
save_dir = '妖神記'
if save_dir not in os.listdir('./'):
os.mkdir(save_dir)
target_url = "https://www.dmzj.com/info/yaoshenji.html"
PS:如有需要Python學習資料的小伙伴可以加下方的群去找免費管理員領取
可以免費領取原始碼、專案實戰視頻、PDF檔案等
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/17474.html
標籤:Python
上一篇:Swoole 中使用異步任務
