前言
本文的文字及圖片過濾網路,可以學習,交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理,
Python爬蟲、資料分析、網站開發等案例教程視頻免費在線觀看
https://space.bilibili.com/523606542
基本開發環境
- Python 3.6
- 皮查姆
目標網頁分析
網站就選擇發表情這個網站吧
網站是靜態網頁,所有的資料都保存在div標簽中,爬取的難度不大,
根據標簽提取其中的表情包url地址以及標題就可以了,
普通爬蟲實作
import requests import parsel import re def change_title(title): pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]") # '/ \ : * ? " < > |' new_title = re.sub(pattern, "_", title) # 替換為下劃線 return new_title for page in range(0, 201): url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' } response = requests.get(url=url, headers=headers) selector = parsel.Selector(response.text) divs = selector.css('.tagbqppdiv') for div in divs: img_url = div.css('a img::attr(data-original)').get() title_ = img_url.split('.')[-1] title = div.css('a img::attr(title)').get() new_title = change_title(title) + title_ img_content = requests.get(url=img_url, headers=headers).content path = 'img\\' + new_title with open(path, mode='wb') as f: f.write(img_content) print(title)
代碼簡單的說明:
1,標題的替換,因為有一些圖片的標題,其中會包含特殊字符,在創建檔案的時候特殊字符是不能命名的,所以需要使用正則把有可能出現的特殊字符替換掉,
divs = selector.css('.tagbqppdiv') for div in divs: img_url = div.css('a img::attr(data-original)').get() title_ = img_url.split('.')[-1] title = div.css('a img::attr(title)').get() new_title = change_title(title) + title_
2,翻頁爬取以及模擬瀏覽器請求網頁
img_content = requests.get(url=img_url, headers=headers).content path = 'img\\' + new_title with open(path, mode='wb') as f: f.write(img_content) print(title)
翻頁多點擊下一頁看一下url地址的變化就可以找到相對應規律了,網站是get請求方式,使用請求請求網頁即可,加上標題請求頭,偽裝瀏覽器請求,如果不加,網站會識別出你是python爬蟲程式請求訪問的,不過對于這個網站,其實加不加都差不多的,
3,決議資料提取想要的資料
img_content = requests.get(url=img_url, headers=headers).content path = 'img\\' + new_title with open(path, mode='wb') as f: f.write(img_content) print(title)
這里我們使用的是parsel決議庫,用的是css選擇器決議的資料,
就是根據標簽屬性提取相對應的資料內容,
4,保存資料
img_content = requests.get(url=img_url, headers=headers).content path = 'img\\' + new_title with open(path, mode='wb') as f: f.write(img_content) print(title)
請求表情包url地址,回傳獲取內容二進制資料,圖片,視頻,檔案等等都是二進制資料保存的,如果是文字則是text,
path就是檔案保存的路徑,因為是二進制資料,所以保存方式是wb,
多執行緒爬蟲實作
import requests import parsel import re import concurrent.futures def get_response(html_url): """模擬瀏覽器請求網址,獲得網頁源代碼""" headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' } response = requests.get(url=html_url, headers=headers) return response def change_title(title): """正則匹配特殊字符標題""" pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]") # '/ \ : * ? " < > |' new_title = re.sub(pattern, "_", title) # 替換為下劃線 return new_title def save(img_url, title): """保存表情到本地檔案""" img_content = get_response(img_url).content path = 'img\\' + title with open(path, mode='wb') as f: f.write(img_content) print(title) def main(html_url): """主函式""" response = get_response(html_url) selector = parsel.Selector(response.text) divs = selector.css('.tagbqppdiv') for div in divs: img_url = div.css('a img::attr(data-original)').get() title_ = img_url.split('.')[-1] title = div.css('a img::attr(title)').get() new_title = change_title(title) + title_ save(img_url, new_title) if __name__ == '__main__': executor = concurrent.futures.ThreadPoolExecutor(max_workers=5) for page in range(0, 201): url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' executor.submit(main, url) executor.shutdown()
簡單的代碼說明:
其實在前文已經有鋪墊了,多執行緒爬蟲就是把每一塊都封裝成函式,讓它每一塊代碼都有它的作用,然后通過執行緒模塊啟動就好,
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
最大的執行緒數
scrapy框架爬蟲實作
關于scrapy框架專案的創建這里只是不過多講了,之前文章有詳細講解過,scrapy框架專案的創建,可以點擊下方鏈接查看
簡單使用scrapy爬蟲框架批量采集網站資料
items.py
import scrapy from ..items import BiaoqingbaoItem class BiaoqingSpider(scrapy.Spider): name = 'biaoqing' allowed_domains = ['fabiaoqing.com'] start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)] def parse(self, response): divs = response.css('#bqb div.ui.segment.imghover div') for div in divs: img_url = div.css('a img::attr(data-original)').get() title = div.css('a img::attr(title)').get() yield BiaoqingbaoItem(img_url=img_url, title=title)
middlewares.py
BOT_NAME = 'biaoqingbao' SPIDER_MODULES = ['biaoqingbao.spiders'] NEWSPIDER_MODULE = 'biaoqingbao.spiders' DOWNLOADER_MIDDLEWARES = { 'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'biaoqingbao.pipelines.DownloadPicturePipeline': 300, } IMAGES_STORE = './images'
pipelines.py
import scrapy from ..items import BiaoqingbaoItem class BiaoqingSpider(scrapy.Spider): name = 'biaoqing' allowed_domains = ['fabiaoqing.com'] start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)] def parse(self, response): divs = response.css('#bqb div.ui.segment.imghover div') for div in divs: img_url = div.css('a img::attr(data-original)').get() title = div.css('a img::attr(title)').get() yield BiaoqingbaoItem(img_url=img_url, title=title)
setting.py
BOT_NAME = 'biaoqingbao' SPIDER_MODULES = ['biaoqingbao.spiders'] NEWSPIDER_MODULE = 'biaoqingbao.spiders' DOWNLOADER_MIDDLEWARES = { 'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543, } ITEM_PIPELINES = { 'biaoqingbao.pipelines.DownloadPicturePipeline': 300, } IMAGES_STORE = './images'
標清
import scrapy from ..items import BiaoqingbaoItem class BiaoqingSpider(scrapy.Spider): name = 'biaoqing' allowed_domains = ['fabiaoqing.com'] start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)] def parse(self, response): divs = response.css('#bqb div.ui.segment.imghover div') for div in divs: img_url = div.css('a img::attr(data-original)').get() title = div.css('a img::attr(title)').get() yield BiaoqingbaoItem(img_url=img_url, title=title)
簡單總結:
三個程式的最大的區別就在于在于爬取速度的相對,但是如果從寫代碼的時間上面來計算的話,普通是最簡單的,因為對于這樣的靜態網站根本不需要除錯,可以從頭寫到位,加上空格一共也就是29行的代碼,
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/248868.html
標籤:Python

