一種可能的解決辦法
同樣也是基于ADSL撥號,不同的是,需要兩臺能夠進行ADSL撥號的服務器,抓取程序中使用這兩臺服務器作為代理。
假設有A、B兩臺可以進行ADSL撥號的服務器。爬蟲程式在C服務器上運行,使用A作為代理訪問外網,如果在抓取程序中遇到禁止訪問的情況,立即將代理切換為B,然后將A進行重新撥號。如果再遇到禁止訪問就切換為A做代理,B再撥號,如此反復。如下圖:
使用A為代理,B撥號:
<img data-rawheight="327" data-rawwidth="721" src=https://bbs.csdn.net/topics/"https://pic1.zhimg.com/50/9196e28cd8621a06cd0f0339f1fa765b_hd.jpg" class="origin_image zh-lightbox-thumb" width="721" data-original="https://pic1.zhimg.com/9196e28cd8621a06cd0f0339f1fa765b_r.jpg">
使用B為代理,A撥號:
<img data-rawheight="327" data-rawwidth="721" src=https://bbs.csdn.net/topics/"https://pic2.zhimg.com/50/7afaf540be23920733bc466ae3f6f651_hd.jpg" class="origin_image zh-lightbox-thumb" width="721" data-original="https://pic2.zhimg.com/7afaf540be23920733bc466ae3f6f651_r.jpg">
代碼爬蟲(web):
import requests
import random
pro=['122.152.196.126','114.215.174.227','119.185.30.75']
head={
'user-Agent':'Mozilla/5.0(Windows NT 10.0;Win64 x64)AppleWebkit/537.36(KHTML,like Gecko) chrome/58.0.3029.110 Safari/537.36'
}
url='http://www.whatismyip.com.tw/'
r=requests.get(url,proxies={'http':random.choice(pro)},headers=head)
r.encoding=r.apparent_encoding
print(r.status_code)
print(r.text)
其他:
# coding=utf-8
import requests
import time
from lxml import etree
def getUrl():
for i in range(33):
url = 'http://task.zbj.com/t-ppsj/p{}s5.html'.format(i+1)
spiderPage(url)
def spiderPage(url):
if url is None:
return None
try:
proxies = {
'http': 'http://221.202.248.52:80',
}
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400'
headers = {'User-Agent': user_agent}
htmlText = requests.get(url, headers=headers,proxies=proxies).text
selector = etree.HTML(htmlText)
tds = selector.xpath('//*[@class="tab-switch tab-progress"]/table/tr')
for td in tds:
price = td.xpath('./td/p/em/text()')
href = td.xpath('./td/p/a/@href')
title = td.xpath('./td/p/a/text()')
subTitle = td.xpath('./td/p/text()')
deadline = td.xpath('./td/span/text()')
price = price[0] if len(price)>0 else ''
title = title[0] if len(title)>0 else ''
href = href[0] if len(href)>0 else ''
subTitle = subTitle[0] if len(subTitle)>0 else ''
deadline = deadline[0] if len(deadline)>0 else ''
print price,title,href,subTitle,deadline
print '---------------------------------------------------------------------------------------'
spiderDetail(href)
except Exception,e:
print '出錯',e.message
def spiderDetail(url):
if url is None:
return None
try:
htmlText = requests.get(url).text
selector = etree.HTML(htmlText)
aboutHref = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href')
price = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/div/p[1]/text()')
title = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/h2/text()')
contentDetail = selector.xpath('//*[@id="utopia_widget_10"]/div[2]/div/div[1]/div[1]/text()')
publishDate = selector.xpath('//*[@id="utopia_widget_10"]/div[2]/div/div[1]/p/text()')
aboutHref = aboutHref[0] if len(aboutHref) > 0 else '' # python的三目運算 :為真時的結果 if 判定條件 else 為假時的結果
price = price[0] if len(price) > 0 else ''
title = title[0] if len(title) > 0 else ''
contentDetail = contentDetail[0] if len(contentDetail) > 0 else ''
publishDate = publishDate[0] if len(publishDate) > 0 else ''
print aboutHref,price,title,contentDetail,publishDate
except:
print '出錯'
if '_main_':
getUrl()
uj5u.com熱心網友回復:
這種需求肯定是找HTTP代理啊,被BAN了立馬切代理。你的方法前提是:撥號一定會獲取不一樣的IP。更何況如果是運營商級別的局域網,不管怎么換,服務器那邊都是一個IP。uj5u.com熱心網友回復:
666666轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/7496.html
標籤:網絡協議與配置
上一篇:VOIP安裝試用協議書下載
