文章目錄

python爬蟲--scrapy（再探）
- scrapy專案創建
- 基于scrapy的全站資料爬取
- 五大核心組件
- 請求傳參
- scrapy圖片爬取
- - 目錄層級
  - 效果圖
- 中間件的使用
- - 下載中間件
  - 中間件案例：網易新聞
- CrawlSpider的全站資料爬取

python爬蟲–scrapy（再探）

scrapy專案創建

請移步這里

基于scrapy的全站資料爬取

—需求：爬取校花網中全部圖片的名稱

http://www.521609.com/meinvxiaohua/

實作方式：

將所有頁面的ur L添加到start_ urls串列(不推薦)
自行手動進行請求發送(推薦)

手動請求發送: yield scrapy. Request (url, callback) : callback專用做于資料決議

創建scrapy以及基于管道的持久化存盤：請點擊此處查看

import scrapy
from meinvNetwork.items import MeinvnetworkItem

class MnspiderSpider(scrapy.Spider):
    name = 'mnSpider'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.521609.com/meinvxiaohua/']
    url = 'http://www.521609.com/meinvxiaohua/list12%d.html'
    page_num = 2
    def parse(self, response):
        li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
        for li in li_list:
            name = li.xpath('./a[2]/b/text() | ./a[2]/text()').extract_first()
            item = MeinvnetworkItem(name=name)
            yield item
        if self.page_num <= 11:
            new_url = format(self.url%self.page_num)
            self.page_num += 1
            yield scrapy.Request(url=new_url,callback=self.parse)

使用終端命令執行專案：scrapy crawl mnSpider
效果圖
在這里插入圖片描述

五大核心組件

在這里插入圖片描述

引擎（Scrapy）

用來處理整個系統的資料流處理，觸發事務（框架核心）

調度器（Scheduler）

用來接收引擎發過來的請求，壓入佇列中，并在引擎再次請求的時候回傳，可以想象成一個URL（抓取網頁的網址或者說是鏈接）的優先佇列，由他來決定下一個要抓取的網址是什么，同時去除重復的網址，

下載器（DownLoader）

用于下載網頁內容，并將網頁內容回傳給蜘蛛（Scrapy下載器是建立在twisted這個搞笑的異步模型上的）

爬蟲（spiders）

爬蟲是主要干活的，用于從特定的網頁中提取自己需要的資訊，即所謂的物體（Item），用戶也可以從中提取出鏈接，讓Scrapy繼續抓取下一個頁面，

專案管道（Pipeline）

負責處理爬蟲從網頁中抽取的物體，主要的功能是持久化物體，驗證物體的有效性、清除不需要的資訊，當頁面被爬蟲決議后，將被發送到專案管理，并經過幾個特定的持續處理資料，

請求傳參

使用場景:如果爬取決議的資料不在同一張頁面中，(深度爬取)

詳見案例：爬取網易新聞

scrapy圖片爬取

圖片資料爬取（ImagesPipeline）

基于scrapy爬取字串型別的資料和爬取圖片型別的資料區別

— 字串：只需要基于小path進行決議且提交管道進行持久化存盤

— 圖片：xpath決議出圖片src的屬性值，單獨的對圖片地址發起請求獲取圖片二進制型別的資料，

使用流程：

— 資料決議（圖片地址）

— 將存盤圖片地址的item提交到指定的管道類

— 在管道檔案中自制一個機遇ImagesPipeline的管道類

?		— def get_media_requests(self,item,info):#根據圖片地址進行資料請求

?		— def file_path(self,request,response=None,info=None):#指定圖片存盤型別

?		—def item_completed(self,results,item,info):#回傳給下一個即將執行的管道類

— 在組態檔中：

?		— 指定圖片存盤的目錄：IMAGES_STORE = './img_temp'

?		— 指定開啟的管道：自制定的管道類

目錄層級

在這里插入圖片描述

img.py

import scrapy
from imgsPro.items import ImgsproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            #注意偽屬性
            img_url = 'https:' + div.xpath('./div/a/img/@src2').extract()[0]
            item = ImgsproItem(img_url=img_url)
            yield item

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_url = scrapy.Field()
    #pass

pipeline.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


# class ImgsproPipeline:
#     def process_item(self, item, spider):
#         return item
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class imgsPipeLine(ImagesPipeline):
    #根據圖片地址進行資料請求
    def get_media_requests(self,item,info):

        yield scrapy.Request(item['img_url'])
    #指定圖片存盤型別
    def file_path(self,request,response=None,info=None):
        imgName = request.url.split('/')[-1]
        return imgName

    # def item_completed(self,results,item,info):
    #     return item #回傳給下一個即將執行的管道類

setting.py


BOT_NAME = 'imgsPro'

SPIDER_MODULES = ['imgsPro.spiders']
NEWSPIDER_MODULE = 'imgsPro.spiders'

LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'imgsPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False


# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'imgsPro.pipelines.imgsPipeLine': 300,
}

#指定圖片存盤路徑
IMAGES_STORE = './img_temp'

效果圖

在這里插入圖片描述

中間件的使用

下載中間件

位置：引擎和下載器之間
作用：批量攔截到整個工程中的所有請求和回應
攔截請求：
- UA偽裝
- 代理IP
攔截回應：篡改回應資料，回應物件，

中間件案例：網易新聞

https://news.163.com/

需求:爬取網易新聞中的新聞資料(標題和內容)

1.通過網易新聞的首頁決議出五大板塊對應的詳情頁的url (沒有動態加載)
2.每一個板塊對應的新聞標題都是動態加載出來的(動態加載)
3.通過決議出每一條新聞詳情頁的url獲取詳情頁的頁面原始碼，決議出新聞內容

目錄層級
在這里插入圖片描述

wangyi.py

import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem

class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://news.163.com/']
    model_urls = []
    def __init__(self):
        self.bro = webdriver.Chrome(executable_path=r"E:\google\Chrome\Application\chromedriver.exe")
    def parse(self, response):
        li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        alist = [3,4,6,7,8]
        for i in alist:
            model_url = li_list[i].xpath('./a/@href').extract_first()
            self.model_urls.append(model_url)
        for url in self.model_urls:
            yield scrapy.Request(url,callback=self.model_parse)

    def model_parse(self,response):
        div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div[1]/div/ul/li/div/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
            if new_detail_url == None:
                continue
            item = WangyiproItem()
            item['title'] = title
            yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})
    def parse_detail(self,response):
        content = response.xpath('//*[@id="content"]/div[2]//text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content
        yield item

    def closed(self,spider):
        self.bro.quit()

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()

middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
from time import sleep


class WangyiproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.



    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        bro = spider.bro
        if request.url in spider.model_urls:
            bro.get(request.url)
            sleep(2)
            page_text = bro.page_source
            new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
            return new_response
        else:

            return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class WangyiproPipeline:
    fp = None

    # 重寫父類的一個方法：該方法只在爬蟲開始的時候被呼叫一次
    def open_spider(self, spider):
        print('開始爬蟲，，，，')
        self.fp = open('./wangyi.txt', 'w', encoding='utf-8')

    def close_spider(self, spider):
        print('爬蟲結束!!!')
        self.fp.close()

    def process_item(self, item, spider):
        title = item['title']
        content = item['content']
        self.fp.write(title+content + '\n')
        return item

setting.py

BOT_NAME = 'wangyiPro'

SPIDER_MODULES = ['wangyiPro.spiders']
NEWSPIDER_MODULE = 'wangyiPro.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
LOG_LEVEL = 'ERROR'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False


# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wangyiPro.pipelines.WangyiproPipeline': 300,
}

效果圖
在這里插入圖片描述

CrawlSpider的全站資料爬取

CrawlSpider是Spider的一個子類
全站資料爬取方式：

基于Spider：手動請求
基于CrawlSpider：

CrawlSpider的使用:

創建一個工程

cd XXX

創建爬蟲檔案(CrawlSpider) :

scrapy genspider -t crawl xxx www.xxx.com

鏈接提取器:
- 作用:根據指定的規則(allow) 進行指定鏈接的提取
規則決議器:
- 作用:將鏈接提取器提取到的鏈接進行指定規則(callback) 的決議

例子：

http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1

sun.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SunSpider(CrawlSpider):
    name = 'sun'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
    link = LinkExtractor(allow=r'id=1&page=\d+')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        #return item
        print(response)