【物聯網識別】cve_details漏洞爬蟲 && 設備指紋庫爬蟲-有解無憂

設計要求
設計并制作出一種漏洞掃描平臺，其要求如下：

（ 1 ）熟悉爬蟲，爬取漏洞、設備詳細資料，構建漏洞庫、設備指紋庫，

（ 2 ）使用 nmap 工具掃描網路，使用 zgrab2 工具輔助掃描網路，

（ 3 ）完成設備指紋識別，漏洞匹配程序，

（ 4 ）使用 nessus 工具驗證漏洞，

（ 5 ）搭建可視化平臺，

（ 6 ）完成設計報告

爬蟲：

本專案中使用 python 語言對 CVE_Details、CNVD、 CNNVD、securityfocus、 ics_cnvd 等漏洞網站爬取不少于 100000 條漏洞詳細資訊，爬取“CVE 編號”、
“危害等級”、“漏洞型別”、“供應商”、“型號”、“設備型別”、“韌體版本號”等
資訊，構建 CVE 漏洞-設備資訊映射庫，

同時，使用 python 語言對京東、亞馬遜、淘寶等電商網站爬取不少于 100000
條設備詳細資訊，爬取“設備型別”、“設備品牌”、“設備型號”等資訊，并將其
構建一個設備指紋庫，

本節小目錄 (本文為作者踩坑記錄，先看別直接上手，代碼在最后)

scrapy框架介紹

CVE_Details爬蟲代碼

CVE漏洞資訊資料庫

setting設定

設備資訊爬蟲（蘇寧）

設備資訊爬蟲（京東）

我只要代碼！（好嘞，哥）

scrapy框架

Scrapy是一個快速的高級Web爬網和Web爬網框架，用于對網站進行爬網并從其頁面中提取結構化資料，

在這里插入圖片描述

Scrapy Engine(引擎)： 負責Spider、ItemPipeline、 Downloader、Scheduler中間的通訊，信號、資料傳遞等，
Scheduler(調度器)： 負責接受引擎發送過來的Request請求，并按照一定的方式進行整理排列, 入隊,當引擎需要時，交還給引擎，
Downloader (下載器)： 負責下載Scrapy Engine(引擎發送的所有Requests請求，并將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，
Spider (爬蟲) ： 它負責處理所有Responses,從中分析提取資料，獲取ltem欄位需要的資料，并將需要跟進的URL提交給引擎,再次進入Scheduler(調度器),
Item Pipeline(管道)： 負責處理Spider中獲取到的Item，并進行進行后期處理（詳細分析、過濾存盤等）的地方.
Downloader Middlewares (下載中間件) ： 你可以當作是一個可以自定義擴展下載功能的組件，
Spider Middlewares (Spider中間件) ： 你可以理解為是一個可以自定擴展和操作引擎和Spider中間通信的功能組件（比如進入Spider的Responses;和從Spider出去的Requests）

運作流程：

1、從spider中獲取到初始url給引擎，告訴引擎幫我給調度器；

2、引擎將初始url給調度器，調度器安排入佇列；

3、調度器告訴引擎已經安排好，并把url給引擎，告訴引擎，給下載器進行下載；

4、引擎將url給下載器，下載器下載頁面原始碼；

5、下載器告訴引擎已經下載好了，并把頁面原始碼response給到引擎；

6、引擎拿著response給到spider，spider決議資料、提取資料；

7、spider將提取到的資料給到引擎，告訴引擎，幫我把新的url給到調度器入佇列，把資訊給到Item Pipelines進行保存；

8、Item Pipelines將提取到的資料保存，保存好后告訴引擎，可以進行下一個url的提取了；

9、回圈3-8步，直到調度器中沒有url，關閉網站（若url下載失敗了，會回傳重新下載），

安裝scrapy

直接用anaconda或者pip命令安裝，推薦裝入venv虛擬環境

conda install scrapy 或 pip install scrapy

可以在命令列直接輸入scrapy驗證是否安裝成功

在這里插入圖片描述

創建專案

scrapy startproject 專案名

就會創建以下目錄結構的專案檔案夾

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

在這里插入圖片描述

這些檔案分別是:

scrapy.cfg: 專案的組態檔，現在可以先忽略，

tutorial/: 該專案的python模塊，

tutorial/items.py: 專案中的item檔案，

? Item 是保存爬取到的資料的容器；其使用方法和python字典類似，并且提供了額外保護機制來避免拼寫錯誤導致的未定義欄位錯誤，

tutorial/pipelines.py: 專案中的pipelines檔案，

Scrapy提供了pipeline模塊來執行保存資料的操作，在創建的 Scrapy 專案中自動創建了一個 pipeline.py 檔案，同時創建了一個默認的 Pipeline 類，比如我們可以在里面寫把item提取的資料保存到mysql資料庫的方法，

tutorial/settings.py: 專案的設定檔案，

settings.py是Scrapy中比較重要的組態檔，里面可以設定的內容非常之多，

tutorial/spiders/: 放置spider代碼的目錄，爬蟲檔案就放在里面

創建爬蟲

scrapy genspider 爬蟲名 要爬取的網站域名 # 注意爬蟲名不要和專案名沖突，網站域名指一級域名，如：baidu.com

生成的爬蟲檔案會放在檔案夾專案名/spider/下為爬蟲名.py

啟動爬蟲

scrapy crawl 爬蟲名字    # 注意是名字，不是爬蟲py檔案

CVE_Details爬蟲

先從setting講起還是從網頁講起呢

不說話那就網頁吧！

訪問https://www.cvedetails.com可以進入CVE_Details的主頁面，這就是我們的重要漏洞庫來源，看左邊Browse–>vulnerabilities by data 點進去會發現自有記錄以來所有的漏洞數量資訊，按照年->月->頁數來爬取太麻煩，跳過月，直接按照每年的爬取

隨便點進一個年份，發現url為https://www.cvedetails.com/vulnerability-list/year-1999/vulnerabilities.html感覺可以替換里面的年份，試試果然可以成功訪問，由于每年漏洞數量和頁數不一樣，頁數的最大值不好搞，，直接選擇前面這個簡單的漏洞數量，然后進行除以50的向上取整運算，就能得到頁數

在這里插入圖片描述

再觀察觀察每一頁的url，發現第2頁開始url就變復雜了

https://www.cvedetails.com/vulnerability-list.php?vendor_id=0&product_id=0&version_id=0&page=3&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=1999&month=0&cweid=0&order=1&trc=894&sha=8fdcb89732c98600636042e1eff8c1b2ff5cb25d

嘗試過后發現幾個重要的引數，page（頁數）, year（年份）, trc（這個就是數量，去掉也沒事）

問題解決，通過 year構造的url --> 得到頁數并構造頁數的url --> 得到每一條cve鏈接 -->訪問cve頁面爬取資訊回傳管道

最重要的是xpath選擇器，推薦看官方檔案的介紹，然后推薦一個很方便嘗試的方式，比如第一步，我們訪問用year構造的url，然后用選擇器得到頁數，可以像下面一樣在cmd中輸入

scrapy shell https://www.cvedetails.com/vulnerability-list/year-1999/vulnerabilities.html

可以很方便的進行選擇器的除錯，我通常在這里試驗選擇器效果

在這里插入圖片描述

回到這個網頁，只需要獲取這個div下的b標簽

在這里插入圖片描述

在cmd里試驗

>>> response.selector.xpath('//div[@class="paging"]')                                                                
[<Selector xpath='//div[@class="paging"]' data='<div class="paging" id="pagingt" styl...'>, <Selector xpath='//div[@class="paging"]' data='<div class="paging" id="pagingb">\n\tTo...'>]

>>> response.selector.xpath('//div[@class="paging"]').get()       
'<div class="paging" id="pagingt" style="display:none; clear:both;"></div>'

>>> response.selector.xpath('//div[@class="paging"]/b').get()                                                          
'<b>894</b>'

>>> response.selector.xpath('//div[@class="paging"]/b/text()').get() 
'894'

輕松到手！于是可以開始寫了，在生成的py檔案里應該有默認代碼，稍微改改就能自己用

class CveDetailSpider(scrapy.Spider):
    name = 'cve_detail'
    allowed_domains = ['https://www.cvedetails.com']
    start_urls = [
        "https://www.cvedetails.com/vulnerability-list/year-" + str(i) + "/vulnerabilities.html" for i in range(1999, 2021)
    ]		# 建議大家這里改為range(2020, 1998, -1)倒序爬取
    
    def parse(self, response):
        # 得到頁數，生成url
        # 獲取cve的數量
        nums = response.selector.xpath('//div[@id="pagingb"]/b/text()').get()
        # 向上取整算出頁數
        pages = ceil(int(nums)/50)
        # 遍歷年份1999-2020年
        for year in range(1999, 2021):
            # 遍歷頁數
            for page in range(1, pages+1):
               	# 通過page，year，nums生成頁面的url
                newurl = self.get_url(str(page), str(year), str(nums))
                yield scrapy.Request(url=newurl, callback=self.parse1, dont_filter=True)

scrapy默認從start_urls[] 尋找可爬取的url，然后默認呼叫parse()進行訪問，注意ceil()方法需要匯入math庫函式，from math import ceil

get_url()是我寫的生成url的函式，如下代碼

yield 兩句話搞定，首先它相當于return，同時它還是一個生成器

def get_url(self, page, year, trc):
        return "https://www.cvedetails.com/vulnerability-list.php?vendor_id=0&product_id=0&version_id=0&page={}&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year={}&month=0&cweid=0&order=1&trc={}&sha=ef7bb39664f094781e7b403da0e482830f5837d6".format(page, year, trc)

yield scrapy.Request()有兩個引數，url和回呼函式，我寫了另外一個回呼函式parse1()來處理下階段的頁面決議

繼續使用cmd進行分析（上一次的退出命令是exit()）

scrapy shell https://www.cvedetails.com/vulnerability-list.php?vendor_id=0&product_id=0&version_id=0&page=3&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=1999&month=0&cweid=0&order=1&trc=894&sha=8fdcb89732c98600636042e1eff8c1b2ff5cb25d

這頁面簡單，我們只需要獲取一個一個的鏈接，就在這些字的a標簽下

在這里插入圖片描述

嘗試一下，就能用response.selector.xpath('//div[@id="searchresults"]/table/tr[@class="srrowns"]/td[@nowrap]/a/@href').get()定位到了，但是get()只能匹配到一個，可以用getall()將頁面上的全部拿到手

>>> response.selector.xpath('//div[@id="searchresults"]/table/tr[@class="srrowns"]/td[@nowrap]/a/@href').getall() 
['/cve/CVE-2019-1020019/', '/cve/CVE-2019-1020018/', '/cve/CVE-2019-1020017/', '/cve/CVE-2019-1020016/', '/cve/CVE-2019-1020015/', '/cve/CVE-2019-1020014/', '/cve/CVE-2019-1020013/', '/cve/CVE-2019-1020012/', '/cve/CVE-2019-1020011/', '/cve/CVE-2019-1020010/', '/cve/CVE-2019-1020009/', '/cve/CVE-2019-1020008/', '/cve/CVE-2019-1020007/', '/cve/CVE-2019-1020006/', '/cve/CVE-2019-1020005/', '/cve/CVE-2019-1020004/', '/cve/CVE-2019-1020003/', '/cve/CVE-2019-1020002/', '/cve/CVE-2019-1020001/', '/cve/CVE-2019-1010319/', '/cve/CVE-2019-1010318/', '/cve/CVE-2019-1010317/', '/cve/CVE-2019-1010316/', '/cve/CVE-2019-1010315/', '/cve/CVE-2019-1010314/', '/cve/CVE-2019-1010312/', '/cve/CVE-2019-1010311/', '/cve/CVE-2019-1010310/', '/cve/CVE-2019-1010309/', '/cve/CVE-2019-1010308/', '/cve/CVE-2019-1010307/', '/cve/CVE-2019-1010306/', '/cve/CVE-2019-1010305/', '/cve/CVE-2019-1010304/', '/cve/CVE-2019-1010302/', '/cve/CVE-2019-1010301/', '/cve/CVE-2019-1010300/', '/cve/CVE-2019-1010299/', '/cve/CVE-2019-1010298/', '/cve/CVE-2019-1010297/', '/cve/CVE-2019-1010296/', '/cve/CVE-2019-1010295/', '/cve/CVE-2019-1010294/', '/cve/CVE-2019-1010293/', '/cve/CVE-2019-1010292/', '/cve/CVE-2019-1010290/', '/cve/CVE-2019-1010287/', '/cve/CVE-2019-1010283/', '/cve/CVE-2019-1010279/', '/cve/CVE-2019-1010275/']

隨便點進一個知道下一次跳轉是"https://www.cvedetails.com"+爬取的url，所以parse1()也會寫了

def parse1(self, response):
    # xpath爬取url串列
    detailurls = response.selector.xpath('//div[@id="searchresults"]/table/tr[@class="srrowns"]/td[@nowrap]/a/@href').getall()
    for detailurl in detailurls:
        # for回圈構造每個子頁面url
        durl = "https://www.cvedetails.com" + detailurl
        yield scrapy.Request(url=durl, callback=self.parse2, dont_filter=True)

這里又用yield回呼了parse2()，沒錯我就是命名天才，略略略

老規矩，cmd

scrapy shell https://www.cvedetails.com/cve/CVE-1999-1567/

找到需要爬取的目標點，CVE編號，危害等級，漏洞型別，供應商，型號，設備型別，韌體版本號

在這里插入圖片描述

# CVE編號
cveid = response.selector.xpath('//h1/a/text()').get()
# 危害等級
score = response.selector.xpath('//div[@class="cvssbox"]/text()').get()

注意有的頁面危害等級為0.0有可能是資訊丟失也有可能是保密資訊，反正頁面沒顯示，寫個判斷直接跳過

if score == '0.0':
    return None

然后是漏洞型別，這個有點麻煩，為空的時候很難定位，有字的時候直接鎖定就行了，所以我獲取了整個表格倒數第二個標簽資訊，再用re正則運算式匹配標簽的資訊，如果findall()匹配不到就會回傳空串列，這我很喜歡，兩句代碼搞定

vulntype =  re.findall(r'">(.*?)</span>', response.selector.xpath('//table[@id="cvssscorestable"]/tr').getall()[-2])
vulntype = '' if vulntype == [] else vulntype[0]

接下來的設備就比較麻煩了，因為很有可能一個漏洞對應很多個設備同時很多個版本，總之，直接把設備串列的每一行都獲取到就行了

>>> response.selector.xpath('//table[@id="vulnprodstable"]/tr').getall()[1:]           
['<tr>\n\t\t\t\t\t\t\t<td class="num">\n\t\t\t\t\t\t\t\t1\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\tApplication\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t<a href="//www.cvedetails.com/vendor/216/Seapine-Software.html" title="Details for Seapine Software">Seapine Software</a>\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t<a href="//www.cvedetails.com/product/380/Seapine-Software-Testtrack.html?vendor_id=216" title="Product Details Seapine Software Testtrack">Testtrack</a>\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<td>\n\t\t\t\t\t\t\t\t <a href="/version/1106/Seapine-Software-Testtrack-.html" title="Seapine Software Testtrack ">Version Details</a>&amp;nbsp<a href="/vulnerability-list/vendor_id-216/product_id-380/version_id-1106/Seapine-Software-Testtrack-.html" title="Vulnerabilities of Seapine Software Testtrack ">Vulnerabilities</a>\t\t\t\t\t\t\t</td>\n\t\t\t\t\t\t</tr>']

很惡心，不過沒關系，兩個規則直接匹配

rule1 = re.compile(r'<a .*>(.*)</a>')
rule2 = re.compile(r'<td>\s+(.*?)\s+</td>')
vendor,product,_ = rule1.findall(make)
producttype,_,_,version,_,_,_,_ = rule2.findall(make)

emmmm，正則有問題的可以隔壁去看看正則運算式，反正我也是用一次查一次

最后實體化管道，然后每一行的設備都和cveid他們存入管道，通過管道存入資料庫，下面是parse2()代碼

def parse2(self, response):
    # CVE編號，危害等級，漏洞型別，供應商，型號，設備型別，韌體版本號
    cveid = response.selector.xpath('//h1/a/text()').get()
    score = response.selector.xpath('//div[@class="cvssbox"]/text()').get()
    if score == '0.0':
        return None
    vulntype =  re.findall(r'">(.*?)</span>', response.selector.xpath('//table[@id="cvssscorestable"]/tr').getall()[-2])
    vulntype = '' if vulntype == [] else vulntype[0]
    makes = response.selector.xpath('//table[@id="vulnprodstable"]/tr').getall()[1:]   
    rule1 = re.compile(r'<a .*>(.*)</a>')
    rule2 = re.compile(r'<td>\s+(.*?)\s+</td>')
    for make in makes:
        vendor,product,_ = rule1.findall(make)
        producttype,_,_,version,_,_,_,_ = rule2.findall(make)
        item = CveDetailsItem()
        item['cveid'],item['score'],item['vulntype'],item['vendor'],item['product'],item['producttype'],item['version'] = cveid,score,vulntype,vendor,product,producttype,version
        yield item

再貼一個完整的爬蟲代碼吧

# -*- coding: utf-8 -*-
import scrapy
from math import ceil
import re
from cve_details.items import CveDetailsItem

class CveDetailSpider(scrapy.Spider):
    name = 'cve_detail'
    allowed_domains = ['https://www.cvedetails.com']
    start_urls = [
        "https://www.cvedetails.com/vulnerability-list/year-" + str(i) + "/vulnerabilities.html" for i in range(1999, 2021)
    ]

    def get_url(self, page, year, trc):
        return "https://www.cvedetails.com/vulnerability-list.php?vendor_id=0&product_id=0&version_id=0&page={}&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year={}&month=0&cweid=0&order=1&trc={}&sha=ef7bb39664f094781e7b403da0e482830f5837d6".format(page, year, trc)

    def parse(self, response):
        # 得到頁數，生成url
        nums = response.selector.xpath('//div[@id="pagingb"]/b/text()').get()   # 獲取cve的數量
        pages = ceil(int(nums)/50)                                              # 算出頁數
        for year in range(1999, 2021):
            for page in range(1, pages+1):
                newurl = self.get_url(str(page), str(year), str(nums))
                yield scrapy.Request(url=newurl, callback=self.parse1, dont_filter=True)
    
    def parse1(self, response):
        detailurls = response.selector.xpath('//div[@id="searchresults"]/table/tr[@class="srrowns"]/td[@nowrap]/a/@href').getall()
        for detailurl in detailurls:
            durl = "https://www.cvedetails.com" + detailurl
            yield scrapy.Request(url=durl, callback=self.parse2, dont_filter=True)

    def parse2(self, response):
        # CVE編號，危害等級，漏洞型別，供應商，型號，設備型別，韌體版本號
        cveid = response.selector.xpath('//h1/a/text()').get()
        score = response.selector.xpath('//div[@class="cvssbox"]/text()').get()
        if score == '0.0':
            return None
        vulntype =  re.findall(r'">(.*?)</span>', response.selector.xpath('//table[@id="cvssscorestable"]/tr').getall()[-2])
        vulntype = '' if vulntype == [] else vulntype[0]
        makes = response.selector.xpath('//table[@id="vulnprodstable"]/tr').getall()[1:]   
        rule1 = re.compile(r'<a .*>(.*)</a>')
        rule2 = re.compile(r'<td>\s+(.*?)\s+</td>')
        for make in makes:
            vendor,product,_ = rule1.findall(make)
            producttype,_,_,version,_,_,_,_ = rule2.findall(make)
            item = CveDetailsItem()
            item['cveid'],item['score'],item['vulntype'],item['vendor'],item['product'],item['producttype'],item['version'] = cveid,score,vulntype,vendor,product,producttype,version
            yield item
            # print(cveid,score,vulntype,vendor,product,producttype,version)

CVE漏洞資訊資料庫

這部分是寫在pipelines.py里的，不過因為里面要呼叫items咱先在items.py宣告一下

import scrapy

class CveDetailsItem(scrapy.Item):
    # define the fields for your item here like:
    cveid = scrapy.Field()
    score = scrapy.Field()
    vulntype = scrapy.Field()
    vendor = scrapy.Field()
    product = scrapy.Field()
    producttype = scrapy.Field()
    version = scrapy.Field()

open_spider（）運行蜘蛛時將自動呼叫此方法

close_spider()關閉蜘蛛時將自動呼叫此方法

process_item（）每個專案管道組件都呼叫此方法

簡單點說，我在open_spider里寫了每次運行爬蟲都檢查資料庫，如果有這個表，就刪掉，然后重建表，process_item里就接收爬蟲發過來的資訊，然后存入資料庫中，close_spider就負責在爬蟲結束的時候關門

沒啥好說的，這部分主要是mysql的安裝和navicat與資料庫的連接，建議出門左轉百度去，直接上代碼

import pymysql

class CveDetailsPipeline:
    tb = 'cve_details'
    number = 0

    def open_spider(self, spider):
        print("開始爬蟲！")
        db = spider.settings.get('MYSQL_DB_NAME','cve_db')
        host = spider.settings.get('MYSQL_HOST','127.0.0.1')
        port = spider.settings.get('MYSQL_PORT', 3306)
        user = spider.settings.get('MYSQL_USER','root')
        passwd = spider.settings.get('MYSQL_PASSWORD','root')

        self.db_conn =pymysql.connect(host=host, port=port, db=db, user=user, passwd=passwd, charset='utf8')
        self.db_cur = self.db_conn.cursor()

        self.db_cur.execute("DROP TABLE IF EXISTS %s"%self.tb)
        sql = """CREATE TABLE IF NOT EXISTS %s (
            id int PRIMARY KEY AUTO_INCREMENT, 
            cveid varchar(32) NOT NULL,
            score varchar(16),
            vulntype varchar(100),
            vendor varchar(56),
            product varchar(56),
            producttype varchar(32),
            version varchar(32)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
        """
        self.db_cur.execute(sql%self.tb)
        print('建表完成！')
        

    def process_item(self, item, spider):
        if item != None:
            values = (
                pymysql.escape_string(item['cveid']),           # CVE編號
                pymysql.escape_string(item['score']),           # 危害等級
                pymysql.escape_string(item['vulntype']),        # 漏洞型別
                pymysql.escape_string(item['vendor']),          # 供應商
                pymysql.escape_string(item['product']),         # 型號
                pymysql.escape_string(item['producttype']),     # 設備型別
                pymysql.escape_string(item['version'])          # 韌體版本號
            )
            # print(type(item['cveid']),type(item['score']),type(item['vulntype']),type(item['vendor']),type(item['product']),type(item['producttype']),type(item['version']))

            sql = '''INSERT INTO cve_details(cveid,score,vulntype,vendor,product,producttype,version) VALUES(%s,%s,%s,%s,%s,%s,%s)'''
            self.db_cur.execute(sql, values)
            self.number += 1
            if self.number >= 200:
                self.db_conn.commit()
                self.number = 0
        return item

    def close_spider(self, spider):
        print("結束爬蟲！")
        self.db_conn.commit()
        self.db_conn.close()

效果如下

在這里插入圖片描述

setting

這里有很多可以自定義的東西，默認都是被注釋的

BOT_NAME = 'demo1’ Scrapy專案的名字

SPIDER_MODULES = ['demo1.spiders'] Scrapy搜索spider的模塊串列

NEWSPIDER_MODULE = 'demo1.spiders' 使用 genspider 命令創建新spider的模塊，

ROBOTSTXT_OBEY = False是否遵守機器人協議，遵守會有很多限制爬取的檔案

CONCURRENT_REQUESTS = 32并發執行緒數，默認16，然后我改的1000感覺沒啥效果，果然python的多執行緒都是假的，

DOWNLOAD_DELAY = 2爬取延時，一般1就行了，默認3，為了防止爬太快對服務器造成過大負擔然后被反爬（封ip）

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
  'Referer':'https://www.cvedetails.com/vulnerability-list/',

}

爬蟲偽造檔案頭，避免被當成爬蟲，給自己帶的面具，user-agent在f12里copy，referer就是跳轉鏈接，比如百度里訪問任何一個網頁都是通過百度跳轉，referer就是baidu.com

ITEM_PIPELINES = {
   'cve_details.pipelines.CveDetailsPipeline': 300,
}

這個就是管道的宣告，300是優先級，數字越低優先級越高

AUTOTHROTTLE_DEBUG = False啟用會顯示很多調節統計的資訊

前面的都是被注釋了，下面這倆得自己寫上去，不然有可能跟我一樣跑了一半網路波動連續幾十個請求失敗，然后沒有重新加入訪問佇列

RETRY_ENABLED:True 是否開啟重試

RETRY_TIMES:5 重試次數

在這里插入圖片描述

運行效果圖：

在這里插入圖片描述

還有一個更簡單的方法，還記得setting里面的ROBOTSTXT_OBEY嗎，直接訪問網站的robots.txt，里面檔案夾下可以訪問的網址都挨個列好了~

在這里插入圖片描述

看到這是不是以為就完了？如果這么認為就真的完了，看看資料庫…

回到cev_details資料庫，這個爬蟲我運行了一下午加一整晚，存了75w+資料還沒停止，資料庫里篩選觀察發現好多重復資料，淦，這爬蟲一直在回圈爬，于是停止開始資料庫去重處理，下面一條是根據cveid,vulntype,vendor,product,version來篩選出重復項的id，并洗掉這些id

mysql> delete from cve_details where id in (select id from (select max(id) as ID,cveid,vulntype,vendor,product,version,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1) as t);
Query OK, 3736 rows affected (3.81 sec)

每次洗掉數量在3k5左右，我可是有75w+資料呢，估計55w+都是重復資料，不可能一次一次運行吧，趕緊寫了個函式，運行

CREATE DEFINER=`root`@`localhost` PROCEDURE `mydel`()
BEGIN
	while exists (select max(id) as ID,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1)  do
	delete from cve_details where id in (select id from (select max(id) as ID,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1) as t);
	end while;
END

i7-7700的cpu瞬間就讓sql多占用了15%資源，磁盤讀取占用1m/s，感徑訓是很慢，洗個澡去

洗澡回來發現資料庫刪完還有3k條了，，，自閉

為了從源頭上止損，我重新修改完善了爬蟲，完美解決問題，只對爬蟲檔案cve_detail.py作了如下修改：

設定了訪問過目錄的記錄表goturls = set()
將start_urls串列改成了start_requests()函式，這樣寫比串列的功能相比多加了一個meta引數，可以傳遞年份值，方便后續構造新的url
parse()內我新添了sha變數用于記錄頁面上獲取的認證，每一個year的sha都不一樣，如果不修改就會導致后面的翻頁功能失效，永遠爬取第一頁

下面貼完整代碼：

# -*- coding: utf-8 -*-
import scrapy
from math import ceil
import re
from cve_details.items import CveDetailsItem

class CveDetailSpider(scrapy.Spider):
    name = 'cve_detail'
    allowed_domains = ['https://www.cvedetails.com']
    goturls = set()

    def start_requests(self):
        for i in range(2020, 1998, -1):
            url = "https://www.cvedetails.com/vulnerability-list/year-" + str(i) + "/vulnerabilities.html"
            yield scrapy.Request(url=url, meta={'year' : i})

    def get_url(self, page, year, trc, sha):
        return "https://www.cvedetails.com/vulnerability-list.php?vendor_id=0&product_id=0&version_id=0&page={}&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year={}&month=0&cweid=0&order=1&trc={}&sha={}".format(page, year, trc, sha)

    def parse(self, response):
        # 得到頁數，生成url
        nums = response.selector.xpath('//div[@id="pagingb"]/b/text()').get()                   # 獲取cve的數量
        pages = ceil(int(nums)/50)                                                              # 算出頁數
        sha = response.selector.xpath('//a[@title="Go to page 1"]/@href').get()  
        if sha != None:
            sha = sha.split('=')[-1]                                                            # 獲取sha
        else:
            return None
        for page in range(1, pages+1):
            newurl = self.get_url(str(page), str(response.meta['year']), str(nums), sha)
            if newurl not in self.goturls:
                self.goturls.add(newurl)
                yield scrapy.Request(url=newurl, callback=self.parse1, dont_filter=True)
            else:
                print('p0訪問重復！！！')
                break
    
    def parse1(self, response):
        detailurls = response.selector.xpath('//div[@id="searchresults"]/table/tr[@class="srrowns"]/td[@nowrap]/a/@href').getall()
        for detailurl in detailurls:
            durl = "https://www.cvedetails.com" + detailurl
            if durl not in self.goturls:
                self.goturls.add(durl)
                yield scrapy.Request(url=durl, callback=self.parse2, dont_filter=True)
            else:
                print('p1訪問重復！！！')
                break

    def parse2(self, response):
        # CVE編號，危害等級，漏洞型別，供應商，型號，設備型別，韌體版本號
        cveid = response.selector.xpath('//h1/a/text()').get()
        score = response.selector.xpath('//div[@class="cvssbox"]/text()').get()
        vulntype =  re.findall(r'">(.*?)</span>', response.selector.xpath('//table[@id="cvssscorestable"]/tr').getall()[-2])
        vulntype = '' if vulntype == [] else vulntype[0]
        makes = response.selector.xpath('//table[@id="vulnprodstable"]/tr').getall()[1:]   
        rule1 = re.compile(r'<a .*>(.*)</a>')
        rule2 = re.compile(r'<td>\s+(.*?)\s+</td>')
        for make in makes:
            if 'No vulnerable product found' in make:
                continue
            vendor,product,_ = rule1.findall(make)
            producttype,_,_,version,_,_,_,_ = rule2.findall(make)
            item = CveDetailsItem()
            item['cveid'],item['score'],item['vulntype'],item['vendor'],item['product'],item['producttype'],item['version'] = cveid,score,vulntype,vendor,product,producttype,version
            yield item

爬了21w斷電就停止了，然后使用上面的資料庫去重函式，因為在網站上每一個漏洞下的設備還有其他引數，好像叫update，連續幾條其他引數全部相同，只有這個修改了，但我們每做過濾直接放入資料庫了，所以需要去重處理

CREATE DEFINER=`root`@`localhost` PROCEDURE `mydel`()
BEGIN
	while exists (select max(id) as ID,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1)  do
	delete from cve_details where id in (select id from (select max(id) as ID,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1) as t);
	end while;
END

在這里插入圖片描述

從21w條變成了16.7w條！！成功了！！！！

在這里插入圖片描述

設備資訊爬蟲（蘇寧）

爬取“設備型別”、“設備品牌”、“設備型號”等資訊，并將其
構建一個設備指紋庫，

想到這我人傻了，cve漏洞庫里全是英文，淘寶京東爬下來設備型別不得全是中文嗎，比如前段時間的小米9漏洞，淘寶里設備型別就是”小米9“，設備型號也是”小米9“…這還怎么做cve和設備映射啊，這一瞬間我甚至想到了語意相似度分析，淦，想太復雜了，經過各個網站分析，淘寶不停彈窗登錄，需要寫模擬登陸，算了，京東和亞馬遜都挺好的，我因為一些原因選擇了蘇寧易購，不過幾個網站都差不多，可以類比著寫，先談談思路吧，有cve的設備幾乎都離不開“網路設備”“智能家居”等等關鍵詞，而這只是一小部分，所以不能寫個遞回頁面url的無腦爬蟲，可以通過關鍵詞控制范圍，這是個不錯的想法

來到蘇寧的首頁https://www.suning.com/,隨便搜個東西，發現換了一個一級域名https://search.suning.com/，看來搜索功能就是在這個域名上實作的了，看看robots爬蟲協議，https://search.suning.com/robots.txt

User-agent: EtaoSpider
Disallow: /
User-Agent: *
Disallow: /emall/
Disallow:/*.do
Disallow:/*cityId*
Disallow:/*%E4%BB%A3%E5%BC%80%E5%8F%91%E7%A5%A8*/
Disallow:/*iy*
Disallow:/*sc*

只有不允許爬的這些地址，話說上面那個url編碼%E4%BB%A3%E5%BC%80%E5%8F%91%E7%A5%A8解碼之后是代開發票，哈哈哈，那就別爬這些吧，反正這些都應該被屏蔽了

搜索”網路設備“進入https://search.suning.com/網路設備/頁面，可以看到有好多商品，怎么翻都翻不完，能感覺到是動態重繪的，不信咱試試，打開cmd，scrapy shell https://search.suning.com/%E7%BD%91%E7%BB%9C%E8%AE%BE%E5%A4%87/這個url編碼就是”網路設備“加密后的，地址欄可以直接復制，每個商品都是一個li標簽，直接輸出li標簽的個數

>>> len(response.selector.xpath('//ul[@class="general clearfix"]/li').getall())       
30

直接頁面打開f12，用選擇工具找到商品，數出商品的個數可遠遠不止30個，為了排除快取影響頁面上重新搜索一個“網路的設備”，別動滑鼠滾輪，f12數出來果然是30，看來下拉果然會加載更多商品，在f12里的network選擇XHR，然后頁面往下劃就能看到一個奇怪的東西混入gif大軍

https://search.suning.com/emall/searchV1Product.do?keyword=%E7%BD%91%E7%BB%9C%E8%AE%BE%E5%A4%87&ci=157162&pg=01&cp=0&il=0&st=0&iy=0&isDoufu=1&isNoResult=0&n=1&sesab=BCAABBABCAAA&id=IDENTIFYING&cc=351&paging=1&sub=0&jzq=470

經實驗發現paging是切換頁面的關鍵，這個時候是不是就拿去scrapy shell里面嘗試了，但是仔細看看這個路徑/emall/是被robots.txt禁止了的，這里就不建議大家繼續嘗試了，畢竟人家超市都規定寵物不得入內，就沒必要非要帶著小倉鼠進去

如果，我是說如果，還想獲得那些公開資料的話，可以用selenium模擬瀏覽器，或者scrapy里在setting里修改ROBOTSTXT_OBEY = False，執行緒數量調低，延時調高，一定不要頻繁訪問頁面造成電商服務器損失，畢竟技術本無罪的前提是不打擾人家

所以這個/email/介面的引數就是paging需要遍歷一遍，每個頁面有30個商品資訊，然后用xpath選擇器選中每個<li>標簽下的標題，它的<a>內有商品鏈接，訪問鏈接xpath選擇這三個引數回傳items就行了

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-8pVhrXJy-1609680323068)(image-20201226122000677.png)]

由于后臺還在跑上一個爬蟲，，我還是新建一個專案吧，scrapy startproject devices創建專案，cd ./devices/spiders進入爬蟲目錄，scrapy genspider device search.suning.com創建爬蟲

下面直接講解代碼：

setting.py

BOT_NAME = 'devices'

SPIDER_MODULES = ['devices.spiders']
NEWSPIDER_MODULE = 'devices.spiders'

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 50

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
  'Referer':'https://search.suning.com/',
}

ITEM_PIPELINES = {
   'devices.pipelines.DevicesPipeline': 300,
}

AUTOTHROTTLE_DEBUG = False
# 是否開啟重試
RETRY_ENABLED:True
# 重試次數
RETRY_TIMES:5

setting里都是說過的，之前講的有點亂，建議ctrl+f搜索

items.py

import scrapy

class DevicesItem(scrapy.Item):
    brand = scrapy.Field()
    modlenumber = scrapy.Field()
    producttype = scrapy.Field()

這里在item物件里寫了三個管道里需要用到的變數，品牌，型號，型別，這里的取名最好跟之前cve表名字對應，當然如果自己認識就無所謂了

searchdevice.py

import scrapy
from devices.items import DevicesItem

class SearchdeviceSpider(scrapy.Spider):
    name = 'searchdevice'
    allowed_domains = ['https://search.suning.com/']

    def start_requests(self):
        keys = [
            '手機',
            '電腦',
            '電視',
            '空調',
            '影音',
            '外設',
            '數碼',
            '網路設備',
            '智能家居',
        ]
        for key in keys:
            url = 'https://search.suning.com/emall/searchV1Product.do?keyword={}&ci=157162&pg=01&cp=0&il=0&st=0&iy=0&isDoufu=1&isNoResult=0&n=1&sesab=ACAABBABCCAA&id=IDENTIFYING&cc=351&sub=0&jzq=319&paging='.format(key)
            # 記錄頁面號
            yield scrapy.Request(url, dont_filter=True, meta={'i' : 0, 'url' : url})
    
    def parse(self, response):
        # 檢查頁面是否溢位，
        if len(response.body) <= 500:
            return None
        urls = response.selector.xpath('//div[@class="title-selling-point"]/a/@href').getall()
        for url in urls:
            yield scrapy.Request("https:"+url, callback=self.parse1, dont_filter=True, meta={'i' : response.meta['i']})
        yield scrapy.Request(response.meta['url']+str(response.meta['i']+1), dont_filter=True, meta={'i' : response.meta['i']+1, 'url' : response.meta['url']}, callback=self.parse)

    def parse1(self, response):
        brand = response.selector.xpath('//div[@id="kernelParmeter"]/ul/li/span/a/text()').get()
        producttype, modlenumber = response.selector.xpath('//div[@id="kernelParmeter"]/ul/li/@title').getall()[1:3]
        # print('品牌：{}，型別：{}，型號：{}'.format(brand, producttype, modlenumber))
        item = DevicesItem()
        item['brand'], item['producttype'], item['modlenumber'] = brand, producttype, modlenumber
        return item

啟動爬蟲默認讀取start_url串列里或者start_requests()函式里的資訊，最開始的url都是從這里生成的，我用搜索關鍵詞替換url，生成可以直接訪問的鏈接，注意我把后面的重要引數&paging=放到了最后，方便下一步的添加頁數，當然經過試驗這樣的url是可以正常訪問的

start_requests()函式呼叫scrapy.Request()，會默認回呼給parse()

parse()內通過html的body長度小于500則該頁面是空的，如果沒超過500就爬取所有商品鏈接通過yield回呼給parse1()進行資料的轉存，而yield的特性是在parse1()執行結束之后會回傳上一次yield的位置也就是for回圈內，知道for回圈結束，最后執行一句控制遞回的yield，他負責給start_requests()生成的url添加頁數并自增

parse1()就是將3個獲取的資訊放入item物件里回傳給管道存盤

pipelines.py

import pymysql

class DevicesPipeline:
    tb = 'device'
    number = 0

    def open_spider(self, spider):
        print("開始爬蟲！")
        db = spider.settings.get('MYSQL_DB_NAME','cve_db')
        host = spider.settings.get('MYSQL_HOST','127.0.0.1')
        port = spider.settings.get('MYSQL_PORT', 3306)
        user = spider.settings.get('MYSQL_USER','root')
        passwd = spider.settings.get('MYSQL_PASSWORD','root')

        self.db_conn =pymysql.connect(host=host, port=port, db=db, user=user, passwd=passwd, charset='utf8')
        self.db_cur = self.db_conn.cursor()

        # 三句話為刪表重建，往資料庫補充資料注釋掉
        self.db_cur.execute("DROP TABLE IF EXISTS %s"%self.tb)
        sql = """CREATE TABLE IF NOT EXISTS %s (
            id int PRIMARY KEY AUTO_INCREMENT, 
            brand varchar(56),
            modlenumber varchar(32),
            producttype varchar(128)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
        """
        self.db_cur.execute(sql%self.tb)
        print('建表完成！')

    def process_item(self, item, spider):
        if item != None:
            sql = 'INSERT INTO {}(brand,modlenumber,producttype) VALUES("{}","{}","{}")'.format(self.tb,item['brand'],item['modlenumber'],item['producttype'])
            print(sql)
            self.db_cur.execute(sql)
            self.number += 1
            if self.number >= 10:
                self.db_conn.commit()
                self.number = 0
        return item

    def close_spider(self, spider):
        print("結束爬蟲！")
        self.db_conn.commit()
        self.db_conn.close()

這個管道就簡單了，啟動爬蟲鏈接資料庫，刪表重建，程序中由爬蟲parse1()傳遞來的資訊直接被存入資料庫

不過就在剛剛，，，導致我的ip被網站反爬機制給ban了，于是繼續調低執行緒，增大延遲，換ip之后繼續，以龜速爬完了1300條，沒聽錯，這站根據我的關鍵詞就只有1300條

在這里插入圖片描述

服了服了，還是換成京東吧，然后使用selenium模擬點擊，

設備資訊爬蟲（京東）

試驗之后就感覺selenium不適合弄爬蟲，，，明明就是用做測驗自動化的，以前還能做做滑鼠驗證碼，現在有機器學習之后誰還用它啊，嗚嗚嗚害得我死機了，根據速度算了下10w條估計得9小時

# -*- coding: utf-8 -*-
import scrapy
from devices_jd.items import DevicesJdItem
from selenium import webdriver
from selenium.webdriver.chrome.options import Options  # 使用無頭瀏覽器
from selenium.webdriver.common.keys import Keys
import time

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")

class DeviceJdSpider(scrapy.Spider):
    name = 'device_jd'
    allowed_domains = ['www.jd.com']
    driver = webdriver.Chrome()
    url = 'https://search.jd.com/Search?keyword={keyword}&enc=utf-8&wq=%E6%89%8B%E6%9C%BA&pvid=24e8b53d9f164ee092d0bcabf99212d9'
    keyword = iter([
        '手機',
        '智能設備',
        '電腦',
        '游戲設備',
        '外設產品',
        '網路產品',
        '辦公設備',
        '智能家居',
        '生活電器',
        '電視',
        '空調',
        '洗衣機',
        '冰箱',
        '廚衛'
    ])

    # 實體化一個瀏覽器物件
    def __init__(self):
        self.browser = webdriver.Chrome(chrome_options=chrome_options)
        super().__init__()
    
    def start_requests(self):
        urls = self.url.format(keyword=next(self.keyword))
        self.driver.get(urls)
        time.sleep(3)
        yield scrapy.Request(url=urls, dont_filter=True)

    def parse(self, response):
        while True:
            for _ in range(0,170):
                self.driver.find_element_by_tag_name('body').send_keys(Keys.ARROW_DOWN)         #在這里使用模擬的下方向鍵
                time.sleep(0.01)
            time.sleep(0.5)
            try:
                producturls=self.driver.find_elements_by_xpath('//div[@class="p-name p-name-type-2"]/a')
                for producturl in producturls:
                    yield scrapy.Request(url=producturl.get_attribute('href'), callback=self.senditem, dont_filter=True) 
            except Exception as e:
                print(e)
            # print(self.driver.find_element_by_class_name('pn-next').get_attribute("title") is "")
            if self.driver.find_element_by_class_name('pn-next').get_attribute("title") is not "":
                self.driver.find_element_by_tag_name('body').send_keys(Keys.ARROW_RIGHT)        #在這里使用模擬的右方向鍵
                time.sleep(1)
            else:
                break
        yield self.start_requests()
    
    def senditem(self, response):
        producttype = response.xpath('//div[@class="item"]/a/text()').extract()[0]
        brand = response.xpath('//ul[@id="parameter-brand"]/li/a/text()').extract()[0]
        product = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li/text()').extract()
        productname = product[0][5:]
        productid = product[1][5:]
        item = DevicesJdItem()
        item['producttype'],item['brand'],item['productname'],item['productid'] = producttype,brand,productname,productid
        # print('型別：' + producttype)
        # print('品牌：' + brand)
        # print('名稱：' + productname)
        # print('編號：' + productid)
        try:
            modlenumber = response.xpath('//dl[@class="clearfix"]/dd/text()').extract()[3]
            item['modlenumber'] = modlenumber
            # print('型號：' + modlenumber)
        except Exception as e:
            print(e)
        yield item

    def close(self, spider):
        self.browser.quit()

大致意思就是根據關鍵詞搜索，這時頁面上有30個商品，然后鍵盤模擬向下按，頁面下劃到底觸發ajax再重繪30個，這時一次性獲取整個頁面的60個商品鏈接送入下一個函式獲取資訊存庫，然后模擬鍵盤右鍵翻頁，以此回圈

太慢了太慢了，于是重新研究頁面，頁面下滑觸發動態重繪時，f12的network里選擇XHR可以看到有新的請求，頭檔案能找到請求的url，類似下面（不影響的引數被我刪了）

https://search.jd.com/Search/s_new.php?keyword=%E6%89%8B%E6%9C%BA%E5%8D%8E%E4%B8%BA&qrst=1&suggest=1.def.0.base&wq=%E6%89%8B%E6%9C%BA%E5%8D%8E%E4%B8%BA&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&page=2&s=27&scrolling=y&log_id=1609148178669.7303&tpl=3_M&isList=0

經過不懈努力，我發現了其中奧秘，keyword，wq為搜索內容的url編碼，一個頁面為兩個page，上面是單數下面是偶數，s控制著顯示第*件物品，因為每一整頁會顯示4-6個廣告，我設定page加1則s+28，留出兩個廣告位，即使重復了之后也能在資料庫去重，log_id為當前微秒級的時間戳

代碼不難理解,首先宣告url和搜索關鍵詞keyword，url內預留了引數的位置方便插入，keyword被我改寫成生成器，之后可以使用next()控制回圈，(for也可以控制生成器的回圈，只是這里不大好用)

class DeviceJdSpider(scrapy.Spider):
    name = 'device_jd'
    allowed_domains = ['www.jd.com']
    url = 'https://search.jd.com/Search/s_new.php?keyword={keyword}&qrst=1&suggest=1.def.0.base&wq={keyword}&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&page={page}&s={s}&scrolling=y&log_id={time}&tpl=3_M&isList=0'
    keyword = iter([
        '手機',
        '智能設備',
        '電腦',
        '游戲設備',
        '外設產品',
        '網路產品',
        '辦公設備',
        '智能家居',
        '生活電器',
        '電視',
        '空調',
        '洗衣機',
        '冰箱',
        '廚衛'
    ])

然后是開始的url，用time.time()生成時間戳，'%.4f'%(ti*1000)這里乘1000是將秒換成毫秒，與引數s統一格式，回呼parse函式并傳入引數keyword,page,s，import time必須放在函式內，，，，不然報錯

def start_requests(self):
        import time
        ti = time.time()
        keyword, page, s, time = next(self.keyword), 1, 1, '%.4f'%(ti*1000)
        urls = self.url.format(keyword=next(self.keyword), page=page, s=s, time=time)
        yield scrapy.Request(url=urls, dont_filter=True, meta={'keyword' : keyword, 'page' : 1, 's' : 1})

獲取每個商品的url并遍歷回呼給下一個處理函式senditem，然后給page加一并判斷是否超過100頁，如果沒有就遞回parse函式，如果超過就再次執行start_requests生成新的url，重新執行一輪

def parse(self, response):
        # 爬取當前頁面每個商品的url
        producturls=response.selector.xpath('//div[@class="p-name p-name-type-2"]/a/@href').getall()
        for producturl in producturls:
            producturl = response.urljoin(producturl)
            yield scrapy.Request(url=producturl, callback=self.senditem, dont_filter=True) 
        import time
        page = response.meta['page'] + 1
        if response.meta['page'] <= 200:
            ti = time.time()
            keyword, s, time = response.meta['keyword'], response.meta['s']+28, '%.4f'%(ti*1000)
            urls = self.url.format(keyword=keyword, page=page, s=s, time=time)
            yield scrapy.Request(url=urls, callback=self.parse, dont_filter=True, meta={'keyword' : keyword, 'page' : page, 's' : s})
        yield self.start_requests()

最后一個處理函式就是簡單的用選擇器選擇需要的資料，然后回傳管道存入資料庫

型別

在這里插入圖片描述

品牌，名稱，編號

在這里插入圖片描述

型號

在這里插入圖片描述

def senditem(self, response):
        producttype = response.xpath('//div[@class="item"]/a/text()').extract()[0]
        brand = response.xpath('//ul[@id="parameter-brand"]/li/a/text()').extract()[0]
        product = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li/text()').extract()
        productname = product[0][5:]
        productid = product[1][5:]
        item = DevicesJdItem()
        item['producttype'],item['brand'],item['productname'],item['productid'] = producttype,brand,productname,productid
        # print('型別：' + producttype)
        # print('品牌：' + brand)
        # print('名稱：' + productname)
        # print('編號：' + productid)
        try:
            modlenumber = response.xpath('//dl[@class="clearfix"]/dd/text()').extract()[3]
            item['modlenumber'] = modlenumber
            # print('型號：' + modlenumber)
        except Exception as e:
            print(e)
            item['modlenumber'] = ''
        yield item

訪問次數過多被京東要求必須登錄了，，，

這里我為了避免被反爬盯上給加了cookie和ip池，專案地址如下，star挺高的

proxy專案github：https://github.com/jhao104/proxy_pool

redis資料庫：d:\redis-x64-5.0.10

輸入命令redis-server.exe redis.windows.conf后，不要關閉，不要關閉
重新打開cmd視窗輸入 redis-cli.exe -h 127.0.0.1 -p 6379

然后我運行bug挺多的，懶得改了，直接刪掉ip池，不要代理了，ban了就換ip換cookie換關鍵字繼續爬

第一輪結束，并沒有想象中的反爬機制，關鍵字爬完了也才1w6行，計劃中是14個關鍵詞 x 100頁 x 60個商品 = 8w4，哪缺這么多呢

'downloader/request_bytes': 37494632,   						# 請求位元組大小
'downloader/request_count': 17296,								# 請求次數
'downloader/request_method_count/GET': 17296, 					# GET請求次數
'downloader/response_bytes': 546113423, 						# 回應位元組大小
'downloader/response_count': 17296,								# 回應次數
'downloader/response_status_count/200': 17296, 					# 狀態碼為200的次數
'elapsed_time_seconds': 10582.425024, 
'finish_reason': 'finished', 									# 爬蟲結束原因
'finish_time': datetime.datetime(2021, 1, 3, 5, 29, 35, 462412), # 爬蟲結束時間
'item_scraped_count': 15879, 	
'log_count/DEBUG': 33175, 										# 日志記錄DEBUG等級次數
'log_count/ERROR': 10, 											# 日志記錄ERROR等級次數
'log_count/INFO': 174, 											# 日志記錄INFO等級次數
'request_depth_max': 201, 										# 最大請求深度
'response_received_count': 17296,								# 接收回應次數
'scheduler/dequeued': 17296, 
'scheduler/dequeued/memory': 17296, 
'scheduler/enqueued': 17296,
'scheduler/enqueued/memory': 17296, 
'start_time': datetime.datetime(2021, 1, 3, 2, 33, 13, 37388)	# 爬蟲開始時間

emmmm如果看懂了再回來補補，時間緊就這樣吧

爬下來的資料再進行一次去重就搞定了

我只要代碼！（好嘞，哥）

都是scrapy爬蟲，只需要安裝scrapy框架，配置好mysql資料庫就能運行，

組態檔都在setting.py，設備資訊爬蟲需要在京東官網上復制最好是（已登錄）的cookie，如果完成之后還要繼續下一輪則更換ip，cookie，以及spiders/device_jd.py里的關鍵詞

github地址：

這是cve漏洞爬蟲

這是設備資訊爬蟲

cve漏洞庫去重函式

CREATE DEFINER=`root`@`localhost` PROCEDURE `mydel`()
BEGIN
	while exists (select max(id) as ID,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1)  do
	delete from cve_details where id in (select id from (select max(id) as ID,count(*) as c from cve_details group by cveid,vulntype,vendor,product,version having c>1) as t);
	end while;
END

設備指紋庫去重函式

CREATE DEFINER=`root`@`localhost` PROCEDURE `shebei`()
BEGIN
	while exists (select max(id) as ID,count(*) as c from device_jd group by productid,productname,modlenumber	having c>1)  do
	delete from device_jd where id in (select id from (select max(id) as ID,count(*) as c from device_jd group by productid,productname,modlenumber having c>1) as t);
	end while;
END

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/244707.html

標籤：其他

上一篇：Api介面：免費開放查詢IPv6歸屬地

下一篇：港口定位專案開發筆記2·微信小程式端