文章目錄
- python爬蟲--scrapy(再探)
- scrapy專案創建
- 基于scrapy的全站資料爬取
- 五大核心組件
- 請求傳參
- scrapy圖片爬取
- 目錄層級
- 效果圖
- 中間件的使用
- 下載中間件
- 中間件案例:網易新聞
- CrawlSpider的全站資料爬取
python爬蟲–scrapy(再探)
scrapy專案創建
請移步這里
基于scrapy的全站資料爬取
—需求:爬取校花網中全部圖片的名稱
http://www.521609.com/meinvxiaohua/
實作方式:
-
將所有頁面的ur L添加到start_ urls串列(不推薦)
-
自行手動進行請求發送(推薦)
手動請求發送:
yield scrapy. Request (url, callback) : callback專用做于資料決議
創建scrapy以及基于管道的持久化存盤:請點擊此處查看
import scrapy
from meinvNetwork.items import MeinvnetworkItem
class MnspiderSpider(scrapy.Spider):
name = 'mnSpider'
#allowed_domains = ['www.xxx.com']
start_urls = ['http://www.521609.com/meinvxiaohua/']
url = 'http://www.521609.com/meinvxiaohua/list12%d.html'
page_num = 2
def parse(self, response):
li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
for li in li_list:
name = li.xpath('./a[2]/b/text() | ./a[2]/text()').extract_first()
item = MeinvnetworkItem(name=name)
yield item
if self.page_num <= 11:
new_url = format(self.url%self.page_num)
self.page_num += 1
yield scrapy.Request(url=new_url,callback=self.parse)
使用終端命令執行專案:scrapy crawl mnSpider
效果圖


五大核心組件

引擎(Scrapy)
- 用來處理整個系統的資料流處理,觸發事務(框架核心)
調度器(Scheduler)
- 用來接收引擎發過來的請求,壓入佇列中,并在引擎再次請求的時候回傳,可以想象成一個URL(抓取網頁的網址或者說是鏈接)的優先佇列,由他來決定下一個要抓取的網址是什么,同時去除重復的網址,
下載器(DownLoader)
- 用于下載網頁內容,并將網頁內容回傳給蜘蛛(Scrapy下載器是建立在twisted這個搞笑的異步模型上的)
爬蟲(spiders)
- 爬蟲是主要干活的,用于從特定的網頁中提取自己需要的資訊,即所謂的物體(Item),用戶也可以從中提取出鏈接,讓Scrapy繼續抓取下一個頁面,
專案管道(Pipeline)
- 負責處理爬蟲從網頁中抽取的物體,主要的功能是持久化物體,驗證物體的有效性、清除不需要的資訊,當頁面被爬蟲決議后,將被發送到專案管理,并經過幾個特定的持續處理資料,
請求傳參
使用場景:如果爬取決議的資料不在同一張頁面中,(深度爬取)
詳見案例:爬取網易新聞
scrapy圖片爬取
圖片資料爬取(ImagesPipeline)
基于scrapy爬取字串型別的資料和爬取圖片型別的資料區別
— 字串:只需要基于小path進行決議且提交管道進行持久化存盤
— 圖片:xpath決議出圖片src的屬性值,單獨的對圖片地址發起請求獲取圖片二進制型別的資料,
使用流程:
— 資料決議(圖片地址)
— 將存盤圖片地址的item提交到指定的管道類
— 在管道檔案中自制一個機遇ImagesPipeline的管道類
? — def get_media_requests(self,item,info):#根據圖片地址進行資料請求
? — def file_path(self,request,response=None,info=None):#指定圖片存盤型別
? —def item_completed(self,results,item,info):#回傳給下一個即將執行的管道類
— 在組態檔中:
? — 指定圖片存盤的目錄:IMAGES_STORE = './img_temp'
? — 指定開啟的管道:自制定的管道類
目錄層級

img.py
import scrapy
from imgsPro.items import ImgsproItem
class ImgSpider(scrapy.Spider):
name = 'img'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://sc.chinaz.com/tupian/']
def parse(self, response):
div_list = response.xpath('//div[@id="container"]/div')
for div in div_list:
#注意偽屬性
img_url = 'https:' + div.xpath('./div/a/img/@src2').extract()[0]
item = ImgsproItem(img_url=img_url)
yield item
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ImgsproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
img_url = scrapy.Field()
#pass
pipeline.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
# class ImgsproPipeline:
# def process_item(self, item, spider):
# return item
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPipeLine(ImagesPipeline):
#根據圖片地址進行資料請求
def get_media_requests(self,item,info):
yield scrapy.Request(item['img_url'])
#指定圖片存盤型別
def file_path(self,request,response=None,info=None):
imgName = request.url.split('/')[-1]
return imgName
# def item_completed(self,results,item,info):
# return item #回傳給下一個即將執行的管道類
setting.py
BOT_NAME = 'imgsPro'
SPIDER_MODULES = ['imgsPro.spiders']
NEWSPIDER_MODULE = 'imgsPro.spiders'
LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'imgsPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'imgsPro.pipelines.imgsPipeLine': 300,
}
#指定圖片存盤路徑
IMAGES_STORE = './img_temp'
效果圖

中間件的使用
下載中間件
- 位置:引擎和下載器之間
- 作用:批量攔截到整個工程中的所有請求和回應
- 攔截請求:
- UA偽裝
- 代理IP
- 攔截回應:篡改回應資料,回應物件,
中間件案例:網易新聞
https://news.163.com/
需求:爬取網易新聞中的新聞資料(標題和內容)
- 1.通過網易新聞的首頁決議出五大板塊對應的詳情頁的url (沒有動態加載)
- 2.每一個板塊對應的新聞標題都是動態加載出來的(動態加載)
- 3.通過決議出每一條新聞詳情頁的url獲取詳情頁的頁面原始碼,決議出新聞內容
目錄層級

wangyi.py
import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem
class WangyiSpider(scrapy.Spider):
name = 'wangyi'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://news.163.com/']
model_urls = []
def __init__(self):
self.bro = webdriver.Chrome(executable_path=r"E:\google\Chrome\Application\chromedriver.exe")
def parse(self, response):
li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
alist = [3,4,6,7,8]
for i in alist:
model_url = li_list[i].xpath('./a/@href').extract_first()
self.model_urls.append(model_url)
for url in self.model_urls:
yield scrapy.Request(url,callback=self.model_parse)
def model_parse(self,response):
div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div[1]/div/ul/li/div/div')
for div in div_list:
title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
if new_detail_url == None:
continue
item = WangyiproItem()
item['title'] = title
yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})
def parse_detail(self,response):
content = response.xpath('//*[@id="content"]/div[2]//text()').extract()
content = ''.join(content)
item = response.meta['item']
item['content'] = content
yield item
def closed(self,spider):
self.bro.quit()
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class WangyiproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
content = scrapy.Field()
middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
from time import sleep
class WangyiproDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
bro = spider.bro
if request.url in spider.model_urls:
bro.get(request.url)
sleep(2)
page_text = bro.page_source
new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
return new_response
else:
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class WangyiproPipeline:
fp = None
# 重寫父類的一個方法:該方法只在爬蟲開始的時候被呼叫一次
def open_spider(self, spider):
print('開始爬蟲,,,,')
self.fp = open('./wangyi.txt', 'w', encoding='utf-8')
def close_spider(self, spider):
print('爬蟲結束!!!')
self.fp.close()
def process_item(self, item, spider):
title = item['title']
content = item['content']
self.fp.write(title+content + '\n')
return item
setting.py
BOT_NAME = 'wangyiPro'
SPIDER_MODULES = ['wangyiPro.spiders']
NEWSPIDER_MODULE = 'wangyiPro.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
LOG_LEVEL = 'ERROR'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'wangyiPro.pipelines.WangyiproPipeline': 300,
}
效果圖

CrawlSpider的全站資料爬取
CrawlSpider是Spider的一個子類
全站資料爬取方式:
- 基于Spider:手動請求
- 基于CrawlSpider:
CrawlSpider的使用:
- 創建一個工程
cd XXX
- 創建爬蟲檔案(CrawlSpider) :
scrapy genspider -t crawl xxx www.xxx.com
- 鏈接提取器:
- 作用:根據指定的規則(allow) 進行指定鏈接的提取
- 規則決議器:
- 作用:將鏈接提取器提取到的鏈接進行指定規則(callback) 的決議
例子:
http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1
sun.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SunSpider(CrawlSpider):
name = 'sun'
#allowed_domains = ['www.xxx.com']
start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
link = LinkExtractor(allow=r'id=1&page=\d+')
rules = (
Rule(link, callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
#return item
print(response)
因為該網站更新技術,所以只能顯示10頁的資料(IP慘遭封禁)


還在學習,目前解決不了
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/272891.html
標籤:python
