
Step 1:每次請求更換一次 User-Agent
首先通過pip下載 fake_useragent ,命令是:pip install fake_useragent
下載完之后當然需要測驗其是否下載成功,及其相關用法:
import fake_useragentdef UserAgent():
user= fake_useragent.UserAgent()
headers = {"User-Agent": "{}".format(user.random)}
return headersprint(UserAgent()) #{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'} ,當然這是隨機的,這就OK了!
Step 2:每次請求時長可以設定隨機(在一個范圍內)
運用time.sleep(random.randint (1,3)),將其放入回圈函式內,我設定的(1,3)是睡眠1秒或者2秒,再進行程式的運行:
import time,random
def RequestSleep(url):
for i in range(30):
time.sleep(random.randint(1,3))
html_file = requests.get(url)
Step 3:每次請求在請求頭中添加代理IP
在一小段時間內,大量請求網站,可能會導致ip被暫時封掉,此時選用代理IP似乎是個不錯的選擇!在百度上搜索代理ip,則會有很多提供的網站,可以先嘗試將其代理ip爬下來存入文本留其以后呼叫,可能有時候這些免費的代理IP回應慢或者已經掛掉了,或者出錢去租一個代理IP池(效果好一點),
代理ip請求資訊的安全分類有三種:
透明代理:請求的服務器知道你使用了代理,也知道你的真實IP
普通匿名代理:請求的服務器知道你使用了代理,但不知道你的真實IP
高級匿名代理:請求的服務器不知道你使用了代理,也不知道你的真實IP
綜上所述,我的爬蟲為了更深的偽裝使用高匿代理IP,下面我就直接那我自己寫的代碼展示了:
# _*_ coding:utf-8 _*_
# Author : Renio
# TimeLog : 2020/2/2 0002 15:08
# FileName: Verfication.py
# SoftWare: PyCharm"""
1、請求頭的隨機生成
2、回傳西刺的ip池串列,并進行選擇出 ip地址、埠,高匿,加密方式
3、再次篩選,通過初次proxie填入篩選出能夠使用的proxie并存于串列
4、前半部可以回圈page次數,后者用于proxies 訪問百度,status_code =200 則存入列,先暫且以txt文本保存
"""from random import random,randint,choice
from fake_useragent import UserAgent
from lxml import etree
import requests,re,time,openpyxl,osdef RandomRequestHeader():
"""request headers random
Direct Call
return random headers
"""
usa = UserAgent()
header = {"User-Agent" : "{}".format(usa.random)}
return headerdef RequestWebFile(XICIurl):
"""
select XICI url's ip in list
need to call url
The return list include HTTP(S)、IPaddr and Ports
"""
headers = RandomRequestHeader()
web_url = requests.get("{}".format(XICIurl),headers=headers,timeout=randint(1,3))
file = web_url.text
html = etree.HTML(file)
RoughScreen = html.xpath("//tr[@class='odd' or @class='']/td/text()")
FirstArrangement = []
for strs in RoughScreen:
if strs.isdigit() == True:
FirstArrangement.append(strs)
elif strs.isalpha() == True:
FirstArrangement.append(strs)
else:
flag = True
nums = 0
while flag :
for strs1 in strs:
nums += 1
if strs1 == ".":
FirstArrangement.append(strs)
break
else:
if nums == len(strs1) or nums > 5:
flag = False
return FirstArrangement #type is listdef ScreeningTest(*ListTable):
"""
Call print(ScreeningTest(*RequestWebFile("https://www.xicidaili.com/nn/")))
"""
MayUseIp = []
headers = RandomRequestHeader()
FileList = ListTable
for x in range(0,len(FileList),4):
try:
requests.get('http://wenshu.court.gov.cn/',headers = headers,
proxies={"{}".format(FileList[x+3]): "{2}://{0}:{1}".format(FileList[x],FileList[x+1],FileList[x+3])})
except:
pass
else:
Proxies = {"{}".format(FileList[x+3]): "{2}://{0}:{1}".format(FileList[x],FileList[x+1],FileList[x+3])}
MayUseIp.append(Proxies)
return MayUseIpdef VerficationProxies():
"""
save available proxies in the list
Direct Call
"""
headers = RandomRequestHeader()
ProxiesList = []
for page in range(1,5):
ProxiesL = ScreeningTest(*RequestWebFile("https://www.xicidaili.com/nn/1".replace("1","",page)))
for i in range(len(ProxiesL)):
proxies = choice(ProxiesL)
web_url = requests.get("{}".format("https://www.baidu.com/"),headers = headers,proxies = proxies )
web_url.encoding="utf-8"
if web_url.status_code == 200 :
print(proxies)
ProxiesList.append(proxies)
file = open("proxies.txt","w+",encoding="utf-8")
file.write(str(ProxiesList))
file.close()
print(len(ProxiesList))
return ProxiesList
VerficationProxies()
Step End :還是把上述上個整合在一起效果會更好
整合重任就交給大家了(我就不展示了)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/195325.html
標籤:Python

