提前準備插件安裝：

pip install scrapy

這里是運行成功的截圖

很多人學習python，不知道從何學起，
很多人學習python，掌握了基本語法過后，不知道在哪里尋找案例上手，
很多已經做案例的人，卻不知道如何去學習更加高深的知識，
那么針對這三類人，我給大家提供一個好的學習平臺，免費領取視頻教程，電子書籍，以及課程的源代碼！??¤
QQ群：623406465

python install Twisted

這里是運行成功的截圖

閱讀目錄

系列文章目錄
前言
一、撰寫Tenxun.py爬蟲檔案
二、在item.py串列里進行設定資料表
三、在pipelines.py串列里進行設定資料表
四、在settings.py檔案里配置爬蟲
五、運行爬蟲
總結

前言

隨著我們對爬蟲的了解，以前我們用requests可以請求進行決議網頁可以提供我們想要的資料，現在我們網頁的資料量很多的時候，我們就要應用Scrapy異步爬蟲進行爬取網頁，下面由我向大家介綠一下Scrapy實戰爬取騰訊招聘的職位

一、撰寫Tenxun.py爬蟲檔案

圖二

此檔案為核心檔案，我們在設計爬蟲網頁時，要在這里進行設計，，這里我將把原始碼公開，進行講解，首先創建一個scrapy專案，下面是實體代碼

scrapy startproject demoTenXun

上面的是運行成功的代碼截圖二，下面我們要在dmoTenXun下面spider檔案夾里新建一個Tenxun.py檔案進行撰寫，上面的是圖三是我們通過F12進行的網頁上的資料，我們可以清楚看到此為爬蟲中的一種“ajax渲染”下面我們要在dmoTenXun下面spider檔案夾里新建一個Tenxun.py檔案進行撰寫，

import scrapy
import json
from demoTenXun.items import DemotenxunItem
class TenXunSpider(scrapy.Spider):

    name = 'Tenxun'    #爬蟲名稱運行時只要這個爬蟲名就可以了
    allowed_domains = ['careers.tencent.com']
    start_urls=['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1602982179339&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
    offer=1
    def parse(self, response):
        # 通過josn讀取資料
        datas=json.loads(response.text)
        for data in datas['Data']['Posts']:
            # 創建一個item物件
            item =DemotenxunItem()
            item['RecruitPostName']=data['RecruitPostName']
            item['Responsibility']=data['Responsibility']
            item['LastUpdateTime']=data['LastUpdateTime']
            item['LocationName']=data['LocationName']
            yield item
        self.offer +=1
        # 這里加一個判斷
        if self.offer <=109:
            #下一次撰寫的url
            next_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1602982179339&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(self.offer)

            yield scrapy.Request(next_url,self.parse)

二、在item.py串列里進行設定資料表

代碼如下（示例）：

import scrapy


class DemotenxunItem(scrapy.Item):
    # define the fields for your item here like:
    RecruitPostName = scrapy.Field()    #崗位名稱
    Responsibility =scrapy.Field()      #崗位職責
    LastUpdateTime=scrapy.Field()       #發布時間
    LocationName=scrapy.Field()         #發布地點
    pass

三、在pipelines.py串列里進行設定資料表

代碼如下（示例）：

import json
import codecs
class DemotenxunPipeline:
    def __init__(self):
        self.file=codecs.open('tensun.csv','a',encoding='GBK')
    def process_item(self, item, spider):
        line = json.dumps(dict(item),ensure_ascii=False) +'\n'
        self.file.write(line)
        return item
        # return item
    def spider_close(self):
        self.file.close()

四、在settings.py檔案里配置爬蟲

下面有些地方修改

#把這個注釋去掉
ITEM_PIPELINES = {
   'demoTenXun.pipelines.DemotenxunPipeline': 300,
}
3在這里加入你的表頭
DEFAULT_REQUEST_HEADERS = {
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0'
}
#改為False
ROBOTSTXT_OBEY = False

五、運行爬蟲

下面為格式

scrapy crawl +你的爬蟲名字（在TenXun.py）中找到你的name=''

下面為代碼

scrapy crawl Tenxun

總結

提示：以上就是今天要講的內容，本文僅僅簡單介紹了Scrapy的使用，但Scrapy提供了大量能使我們快速便捷地爬取資料的方法，

人生若知，我用python

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/180708.html

標籤：其他

上一篇：教你如何幫助前端同學快速生成API介面代碼

下一篇：Django筆記：Memcached快取系統

Python爬蟲高級入門，Scrapy框架入門級案例實戰！