python-手機爬蟲fiddler-開啟資料收集新時代-有解無憂

#移動端爬蟲介紹
1.移動端爬蟲的思路，怎么爬取APP里面的內容：

很多人學習python，不知道從何學起，
很多人學習python，掌握了基本語法過后，不知道在哪里尋找案例上手，
很多已經做案例的人，卻不知道如何去學習更加高深的知識，
那么針對這三類人，我給大家提供一個好的學習平臺，免費領取視頻教程，電子書籍，以及課程的源代碼！
QQ群：961562169

a.手機和電腦要通信，依靠 fiddler（相當于建立一個資料中轉站）；
b.訪問網頁的方式進行資料爬取；

2.fiddler及手機需要配置的東西:
a.下載并安裝fiddler，電腦與手機在 同一網路下 ;
b.電腦端配置見下圖：cmd->ipconfig可獲得ip地址，用于后面手機端的配置:
在這里插入圖片描述 ?
?
c.手機端配置（抖音及快手抓取的時候會有反扒，配置完成后如果你想抓取他的網站，他會禁止你的網路，解決辦法只能是電腦端下載手機模擬器，可以解決反爬：可能過一陣子又優化了）：
#1.設定網路代理： 主機名： 電腦ip地址，不固定，隨網路變化而變化；埠是fidder埠： 可修改（根據手機不同設定方式可能有區別，但記住只要這兩個改了，就問題不大）；
#2.手機下載證書(開放爬取權限)：瀏覽器輸入網址：http://ip地址：埠號，手機瀏覽器打不開，電腦下載然后手動傳到手機即可；

在這里插入圖片描述 ?
?
3.爬蟲實體：今日頭條動漫詞條圖片爬取scrapy:
?
?

目錄：
在這里插入圖片描述 ?

settings:

# -*- coding: utf-8 -*-

# Scrapy settings for images project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'images'

SPIDER_MODULES = ['images.spiders']
NEWSPIDER_MODULE = 'images.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'images (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'images.middlewares.ImagesSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'images.middlewares.ImagesDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
# #上面只是個訪問header，加個降低被拒絕的保險
ITEM_PIPELINES = {
   'images.pipelines.ImagesPipeline': 300,
}
IMAGES_STORE ='D:\\python\\Scrapy\\image\\test'


#IMAGES_EXPIRES = 90
#IMAGES_MIN_HEIGHT = 100
#IMAGES_MIN_WIDTH = 100
#其中IMAGES_STORE是設定的是圖片保存的路徑，IMAGES_EXPIRES是設定的專案保存的最長時間，
# IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH是設定的圖片尺寸大小

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ImagesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
# image_urls和images是固定的,不能改名字

images_toutiao:

# -*- coding: utf-8 -*-
import scrapy
import re
from ..items import ImagesItem

class ImagesToutiaoSpider(scrapy.Spider):
    name = 'images_toutiao'
    allowed_domains = ['a3-ipv6.pstatp.com']
    start_urls = ['https://a3-ipv6.pstatp.com/article/content/25/1/6819945008301343243/6819945008301343243/1/0/0']  # 構造爬取的URL

    # 爬取圖片ID：
#:https://a3-ipv6.pstatp.com/article/content/25/1/6819945008301343243/6819945008301343243/1/0/0
#https://a3-ipv6.pstatp.com/article/content/25/1/6848145029051974155/6848145029051974155/1/0/0
#https://a6-ipv6.pstatp.com/article/content/25/1/6848145029051974155/6848145029051974155/1/0/0
#https://a3-ipv6.pstatp.com/article/content/25/1/6848145029051974155/6848145029051974155/1/0/0        #找了三個鏈接，是基本相同的地址

    def parse(self, response):
        result = response.body.decode()  # 對start_urls獲取的回應進行解碼
        contents = re.findall(r'},{"url":"(.*?)"}', result)

        for i in range(0, len(contents)):
            if len(contents[i]) <= len(contents[0]):

                item = ImagesItem()
                list = []
                list.append(contents[i])
                item['image_urls'] = contents
                print(list)
                yield item
            else:
                pass
        #翻頁-爬取多個頁面的圖片
        # self.page = [6819945008301343243/6819945008301343243/1/0/0,6819945008301343243/6819945008301343243/1/0/0,]
        # for i in self.page  #只爬前5頁
        #     url="https://a3-ipv6.pstatp.com/article/content/25/1/"+str(self.page)
        #     yield scrapy.Request(url,callback=self.parse)

pipelines:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

#這里的兩個函式get_media_requests和item_completed都是scrapy的內置函式，想重命名的就這這里操作
#可以直接復制這里的代碼就可以用了

class ImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok, x in results if ok]
        if not image_path:
            raise DropItem('Item contains no images')
        #item['image_paths'] = image_path
        return item

#     def file_path(self, request, response=None, info=None):
#         name = request.meta['name']    # 接收上面meta傳遞過來的圖片名稱
#         name = re.sub(r'[？\\*|“<>:/]', '', name)    # 過濾windows字串，不經過這么一個步驟，你會發現有亂碼或無法下載
#         filename= name +'.jpg'          #添加圖片后綴名
#         return filename

上就完成了一個今日頭條APP的爬取，我們剛開始接觸也許會覺得難，會遇到一些問題，但是真的了解學會之后，會發現相對于網頁端爬取就是一個配置的問題，配置也不是很復雜，
最近在學習網頁開發的模板，做一個博客網站的開發，進度及其緩慢，只是因為自己不會寫靜態網頁，但是最近解決了，網上找了一個，自己修改了一下；這也讓我明白了一個道理，事情不做他的困難程度就會在我們心里慢慢累積，可能會累積到使我們放棄，但是你真正突破的時候發現，不過如此，此心態放在我們生活中面對困難同樣適用；
我舉個簡單的例子：學車大部分人都經歷過，我們學的時候感覺就是我們的全部，過不了就感覺人生特別失敗，但是當駕駛證到手的時候，回頭一看不過如此，懷疑當初的自己是怎么了，就這樣吧，有疑問隨時溝通，
第七篇分享，持續更新中，，
，，最近真的還挺努力的，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/112871.html

標籤：其他

上一篇：閉嘴，給你一個數！1分鐘，學完C語言指標，不扎手只扎心的針！

下一篇：知乎24W、GitHub70K點贊，就因為這份騰訊Java面試核心開源了