#移動端爬蟲介紹
1.移動端爬蟲的思路,怎么爬取APP里面的內容:
很多人學習python,不知道從何學起,
很多人學習python,掌握了基本語法過后,不知道在哪里尋找案例上手,
很多已經做案例的人,卻不知道如何去學習更加高深的知識,
那么針對這三類人,我給大家提供一個好的學習平臺,免費領取視頻教程,電子書籍,以及課程的源代碼!
QQ群:961562169
a.手機和電腦要通信,依靠 fiddler(相當于建立一個資料中轉站);
b.訪問網頁的方式進行資料爬取;
2.fiddler及手機需要配置的東西:
a.下載并安裝fiddler,電腦與手機在 同一網路下 ;
b.電腦端配置見下圖:cmd->ipconfig可獲得ip地址,用于后面手機端的配置:
?

?
c.手機端配置(抖音及快手抓取的時候會有反扒,配置完成后如果你想抓取他的網站,他會禁止你的網路,解決辦法只能是電腦端下載手機模擬器,可以解決反爬:可能過一陣子又優化了):
#1.設定網路代理: 主機名: 電腦ip地址,不固定,隨網路變化而變化;埠是fidder埠: 可修改(根據手機不同設定方式可能有區別,但記住只要這兩個改了,就問題不大);
#2.手機下載證書(開放爬取權限): 瀏覽器輸入網址:http://ip地址:埠號,手機瀏覽器打不開,電腦下載然后手動傳到手機即可;

?

?
3.爬蟲實體:今日頭條動漫詞條圖片爬取scrapy:?

?
目錄:?
settings:
# -*- coding: utf-8 -*-
# Scrapy settings for images project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'images'
SPIDER_MODULES = ['images.spiders']
NEWSPIDER_MODULE = 'images.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'images (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'images.middlewares.ImagesSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'images.middlewares.ImagesDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
# #上面只是個訪問header,加個降低被拒絕的保險
ITEM_PIPELINES = {
'images.pipelines.ImagesPipeline': 300,
}
IMAGES_STORE ='D:\\python\\Scrapy\\image\\test'
#IMAGES_EXPIRES = 90
#IMAGES_MIN_HEIGHT = 100
#IMAGES_MIN_WIDTH = 100
#其中IMAGES_STORE是設定的是圖片保存的路徑,IMAGES_EXPIRES是設定的專案保存的最長時間,
# IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH是設定的圖片尺寸大小
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
items:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ImagesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
# image_urls和images是固定的,不能改名字
images_toutiao:
# -*- coding: utf-8 -*-
import scrapy
import re
from ..items import ImagesItem
class ImagesToutiaoSpider(scrapy.Spider):
name = 'images_toutiao'
allowed_domains = ['a3-ipv6.pstatp.com']
start_urls = ['https://a3-ipv6.pstatp.com/article/content/25/1/6819945008301343243/6819945008301343243/1/0/0'] # 構造爬取的URL
# 爬取圖片ID:
#:https://a3-ipv6.pstatp.com/article/content/25/1/6819945008301343243/6819945008301343243/1/0/0
#https://a3-ipv6.pstatp.com/article/content/25/1/6848145029051974155/6848145029051974155/1/0/0
#https://a6-ipv6.pstatp.com/article/content/25/1/6848145029051974155/6848145029051974155/1/0/0
#https://a3-ipv6.pstatp.com/article/content/25/1/6848145029051974155/6848145029051974155/1/0/0 #找了三個鏈接,是基本相同的地址
def parse(self, response):
result = response.body.decode() # 對start_urls獲取的回應進行解碼
contents = re.findall(r'},{"url":"(.*?)"}', result)
for i in range(0, len(contents)):
if len(contents[i]) <= len(contents[0]):
item = ImagesItem()
list = []
list.append(contents[i])
item['image_urls'] = contents
print(list)
yield item
else:
pass
#翻頁-爬取多個頁面的圖片
# self.page = [6819945008301343243/6819945008301343243/1/0/0,6819945008301343243/6819945008301343243/1/0/0,]
# for i in self.page #只爬前5頁
# url="https://a3-ipv6.pstatp.com/article/content/25/1/"+str(self.page)
# yield scrapy.Request(url,callback=self.parse)
pipelines:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
#這里的兩個函式get_media_requests和item_completed都是scrapy的內置函式,想重命名的就這這里操作
#可以直接復制這里的代碼就可以用了
class ImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_path = [x['path'] for ok, x in results if ok]
if not image_path:
raise DropItem('Item contains no images')
#item['image_paths'] = image_path
return item
# def file_path(self, request, response=None, info=None):
# name = request.meta['name'] # 接收上面meta傳遞過來的圖片名稱
# name = re.sub(r'[?\\*|“<>:/]', '', name) # 過濾windows字串,不經過這么一個步驟,你會發現有亂碼或無法下載
# filename= name +'.jpg' #添加圖片后綴名
# return filename
上就完成了一個今日頭條APP的爬取,我們剛開始接觸也許會覺得難,會遇到一些問題,但是真的了解學會之后,會發現相對于網頁端爬取就是一個配置的問題,配置也不是很復雜,
最近在學習網頁開發的模板,做一個博客網站的開發,進度及其緩慢,只是因為自己不會寫靜態網頁,但是最近解決了,網上找了一個,自己修改了一下;這也讓我明白了一個道理,事情不做他的困難程度就會在我們心里慢慢累積,可能會累積到使我們放棄,但是你真正突破的時候發現,不過如此,此心態放在我們生活中面對困難同樣適用;
我舉個簡單的例子:學車大部分人都經歷過,我們學的時候感覺就是我們的全部,過不了就感覺人生特別失敗,但是當駕駛證到手的時候,回頭一看不過如此,懷疑當初的自己是怎么了,就這樣吧,有疑問隨時溝通,
第七篇分享,持續更新中,,
,,最近真的還挺努力的,
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/112871.html
標籤:其他
