Scrapy作為爬蟲的進階內容,可以實作多執行緒爬取目標內容,簡化代碼邏輯,提高開發效率,深受爬蟲開發者的喜愛,本文主要以爬取某股票網站為例,簡述如何通過Scrapy實作爬蟲,僅供學習分享使用,如有不足之處,還請指正,
什么是Scrapy?
Scrapy是用python實作的一個為了爬取網站資料,提取結構性資料而撰寫的應用框架,使用Twisted高效異步網路框架來處理網路通信,Scrapy架構:

關于Scrapy架構各項說明,如下所示:
- ScrapyEngine:引擎,負責控制資料流在系統中所有組件中流動,并在相應動作發生時觸發事件, 此組件相當于爬蟲的“大腦”,是 整個爬蟲的調度中心,
- Schedule:調度器,接收從引擎發過來的requests,并將他們入隊,初始爬取url和后續在頁面里爬到的待爬取url放入調度器中,等待被爬取,調度器會自動去掉重復的url,
- Downloader:下載器,負責獲取頁面資料,并提供給引擎,而后提供給spider,
- Spider:爬蟲,用戶編些用于分析response并提取item和額外跟進的url,將額外跟進的url提交給ScrapyEngine,加入到Schedule中,將每個spider負責處理一個特定(或 一些)網站,
- ItemPipeline:負責處理被spider提取出來的item,當頁面被爬蟲決議所需的資料存入Item后,將被發送到Pipeline,并經過設定好次序
- DownloaderMiddlewares:下載中間件,是在引擎和下載器之間的特定鉤子(specific hook),處理它們之間的請求(request)和回應(response),提供了一個簡單的機制,通過插入自定義代碼來擴展Scrapy功能,通過設定DownloaderMiddlewares來實作爬蟲自動更換user-agent,IP等,
- SpiderMiddlewares:Spider中間件,是在引擎和Spider之間的特定鉤子(specific hook),處理spider的輸入(response)和輸出(items或requests),提供了同樣簡單機制,通過插入自定義代碼來擴展Scrapy功能,
Scrapy資料流:
- ScrapyEngine打開一個網站,找到處理該網站的Spider,并向該Spider請求第一個(批)要爬取的url(s);
- ScrapyEngine向調度器請求第一個要爬取的url,并加入到Schedule作為請求以備調度;
- ScrapyEngine向調度器請求下一個要爬取的url;
- Schedule回傳下一個要爬取的url給ScrapyEngine,ScrapyEngine通過DownloaderMiddlewares將url轉發給Downloader;
- 頁面下載完畢,Downloader生成一個頁面的Response,通過DownloaderMiddlewares發送給ScrapyEngine;
- ScrapyEngine從Downloader中接收到Response,通過SpiderMiddlewares發送給Spider處理;
- Spider處理Response并回傳提取到的Item以及新的Request給ScrapyEngine;
- ScrapyEngine將Spider回傳的Item交給ItemPipeline,將Spider回傳的Request交給Schedule進行從第二步開始的重復操作,直到調度器中沒有待處理的Request,ScrapyEngine關閉,
Scrapy安裝
在命令列模式下,通過pip install scrapy命令進行安裝Scrapy,如下所示:

當出現以下提示資訊時,表示安裝成功

Scrapy創建專案
在命令列模式下,切換到專案存放目錄,通過scrapy startproject stockstar 創建爬蟲專案,如下所示:

根據提示,通過提供的模板,創建爬蟲【命令格式:scrapy genspider 爬蟲名稱 域名】,如下所示:

注意:爬蟲名稱,不能跟專案名稱一致,否則會報錯,如下所示:

通過Pycharm打開新創建的scrapy專案,如下所示:

爬取目標
本例主要爬取某證券網站行情中心股票ID與名稱資訊,如下所示:

Scrapy爬蟲開發
通過命令列創建專案后,基本Scrapy爬蟲框架已經形成,剩下的就是業務代碼填充,
item項定義
定義需要爬取的欄位資訊,如下所示:
1 class StockstarItem(scrapy.Item): 2 """ 3 定義需要爬取的欄位名稱 4 """ 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 stock_type = scrapy.Field() # 股票型別 8 stock_id = scrapy.Field() # 股票ID 9 stock_name = scrapy.Field() # 股票名稱
定制爬蟲邏輯
Scrapy的爬蟲結構是固定的,定義一個類,繼承自scrapy.Spider,類中定義屬性【爬蟲名稱,域名,起始url】,重寫父類方法【parse】,根據需要爬取的頁面邏輯不同,在parse中定制不同的爬蟲代碼,如下所示:
1 class StockSpider(scrapy.Spider): 2 name = 'stock' 3 allowed_domains = ['quote.stockstar.com'] # 域名 4 start_urls = ['http://quote.stockstar.com/stock/stock_index.htm'] # 啟動的url 5 6 def parse(self, response): 7 """ 8 決議函式 9 :param response: 10 :return: 11 """ 12 item = StockstarItem() 13 styles = ['滬A', '滬B', '深A', '深B'] 14 index = 0 15 for style in styles: 16 print('********************本次抓取' + style[index] + '股票********************') 17 ids = response.xpath( 18 '//div[@]/div[@]/div[@]/div[' 19 '@]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall() 20 names = response.xpath( 21 '//div[@]/div[@]/div[@]/div[' 22 '@]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall() 23 # print('ids = '+str(ids)) 24 # print('names = ' + str(names)) 25 for i in range(len(ids)): 26 item['stock_type'] = style 27 item['stock_id'] = str(ids[i]) 28 item['stock_name'] = str(names[i]) 29 yield item
資料處理
在Pipeline中,對抓取的資料進行處理,本例為簡便,在控制進行輸出,如下所示:
1 class StockstarPipeline: 2 def process_item(self, item, spider): 3 print('股票型別>>>>'+item['stock_type']+'股票代碼>>>>'+item['stock_id']+'股票名稱>>>>'+item['stock_name']) 4 return item
注意:在對item進行賦值時,只能通過item['key']=value的方式進行賦值,不可以通過item.key=value的方式賦值,
Scrapy配置
通過settings.py檔案進行配置,包括請求頭,管道,robots協議等內容,如下所示:
1 # Scrapy settings for stockstar project 2 # 3 # For simplicity, this file contains only settings considered important or 4 # commonly used. You can find more settings consulting the documentation: 5 # 6 # https://docs.scrapy.org/en/latest/topics/settings.html 7 # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 10 BOT_NAME = 'stockstar' 11 12 SPIDER_MODULES = ['stockstar.spiders'] 13 NEWSPIDER_MODULE = 'stockstar.spiders' 14 15 16 # Crawl responsibly by identifying yourself (and your website) on the user-agent 17 #USER_AGENT = 'stockstar (+http://www.yourdomain.com)' 18 19 # Obey robots.txt rules 是否遵守robots協議 20 ROBOTSTXT_OBEY = False 21 22 # Configure maximum concurrent requests performed by Scrapy (default: 16) 23 #CONCURRENT_REQUESTS = 32 24 25 # Configure a delay for requests for the same website (default: 0) 26 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 27 # See also autothrottle settings and docs 28 #DOWNLOAD_DELAY = 3 29 # The download delay setting will honor only one of: 30 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 31 #CONCURRENT_REQUESTS_PER_IP = 16 32 33 # Disable cookies (enabled by default) 34 #COOKIES_ENABLED = False 35 36 # Disable Telnet Console (enabled by default) 37 #TELNETCONSOLE_ENABLED = False 38 39 # Override the default request headers: 40 DEFAULT_REQUEST_HEADERS = { 41 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 42 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #, 43 # 'Accept-Language': 'en,zh-CN,zh;q=0.9' 44 } 45 46 # Enable or disable spider middlewares 47 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 48 #SPIDER_MIDDLEWARES = { 49 # 'stockstar.middlewares.StockstarSpiderMiddleware': 543, 50 #} 51 52 # Enable or disable downloader middlewares 53 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 54 #DOWNLOADER_MIDDLEWARES = { 55 # 'stockstar.middlewares.StockstarDownloaderMiddleware': 543, 56 #} 57 58 # Enable or disable extensions 59 # See https://docs.scrapy.org/en/latest/topics/extensions.html 60 #EXTENSIONS = { 61 # 'scrapy.extensions.telnet.TelnetConsole': None, 62 #} 63 64 # Configure item pipelines 65 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 66 ITEM_PIPELINES = { 67 'stockstar.pipelines.StockstarPipeline': 300, 68 } 69 70 # Enable and configure the AutoThrottle extension (disabled by default) 71 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 72 #AUTOTHROTTLE_ENABLED = True 73 # The initial download delay 74 #AUTOTHROTTLE_START_DELAY = 5 75 # The maximum download delay to be set in case of high latencies 76 #AUTOTHROTTLE_MAX_DELAY = 60 77 # The average number of requests Scrapy should be sending in parallel to 78 # each remote server 79 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 80 # Enable showing throttling stats for every response received: 81 #AUTOTHROTTLE_DEBUG = False 82 83 # Enable and configure HTTP caching (disabled by default) 84 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 85 #HTTPCACHE_ENABLED = True 86 #HTTPCACHE_EXPIRATION_SECS = 0 87 #HTTPCACHE_DIR = 'httpcache' 88 #HTTPCACHE_IGNORE_HTTP_CODES = [] 89 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'View Code
Scrapy運行
因scrapy是各個獨立的頁面,只能通過終端命令列的方式運行,格式為:scrapy crawl 爬蟲名稱,如下所示:
1 scrapy crawl stock
如下圖所示:

備注
本例內容相對簡單,僅為說明Scrapy的常見用法,爬取的內容都是第一次請求能夠獲取到原始碼的內容,即所見即所得,
遺留兩個小問題:
- 對于爬取的內容需要翻頁才能完成,即多次請求,如何處理?
- 對于爬取的內容是異步傳輸,頁面請求只是獲取一個框架,內容是異步填充,即常見的ajax方式,如何處理?
以上兩個問題,待后續遇到時,再進一步分析,一首陶淵明的歸田園居,與君共享,
歸園田居(其一)
【作者】陶淵明 【朝代】魏晉少無適俗韻,性本愛丘山,誤落塵網中,一去三十年,
羈鳥戀舊林,池魚思故淵,開荒南野際,守拙歸園田,
方宅十余畝,草屋八九間,榆柳蔭后檐,桃李羅堂前,
曖曖遠人村,依依墟里煙,狗吠深巷中,雞鳴桑樹顛,
戶庭無塵雜,虛室有余閑,久在樊籠里,復得返自然,
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/293692.html
標籤:其他
