前言
繼續猿人學的題
分析
打開網站:

直接翻頁找介面

根據之前題的分析得知,肯定也是3和3?page=xx的是資料介面了,那么看下這個介面里的請求引數,發現就一個get請求,也沒有請求引數,只有一個cookie

看到cookie是sessionid的,有經驗的朋友應該知道這個是服務端生成的,有的必須要帶上,有的可以不用帶上,我們先不帶上試試:
臥槽,回傳了一段js,這不對啊,瀏覽器里是沒有的,那我們帶上js看看,取消cookie的注釋:

結果還是這樣,那就有點意思了,那就看看看這段js到底是個啥,格式化以后就是如下:
var x = "div@Expires@@captcha@while@length@@reverse@0xEDB88320@substr@fromCharCode@234@@0@@@11@1500@@cookie@@36@createElement@JgSe0upZ@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@eval@@window@href@GMT@String@attachEvent@false@toLowerCase@@2@Array@@@@Path@@@@f@if@@@26@@addEventListener@@@try@return@location@toString@@@@@@pathname@@@@setTimeout@@replace@a@innerHTML@@@@1589175086@else@@document@3@@@@https@join@for@@DOMContentLoaded@06@e@@@@@new@catch@var@@May@@split@@function@1@charAt@@__jsl_clearance@0xFF@firstChild@search@31@chars@charCodeAt@20@parseInt@8@@match@RegExp@Mon@challenge@@g@onreadystatechange@@d@".replace(/@*$/, "").split("@"),
y = "1L N=22(){1i('17.v=17.1e+17.29.1k(/[\\?|&]4-2k/,\\'\\')',i);1t.k='26=1q.c|e|'+(22(){1L t=[22(N){16 s('x.b('+N+')')},(22(){1L N=1t.n('1');N.1m='<1l v=\\'/\\'>1H</1l>';N=N.28.v;1L t=N.2h(/1y?:\\/\\//)[e];N=N.a(t.6).A();16 22(t){1A(1L 1H=e;1H<t.6;1H++){t[1H]=N.24(t[1H])};16 t.1z('')}})()],1H=[[[-~[-~(-~((-~{}|-~[]-~[])))]]+[-~[-~(-~((-~{}|-~[]-~[])))]],[((+!~~{})<<-~[-~-~{}])]+[((+!~~{})<<-~[-~-~{}])],[-~[-~(-~((-~{}|-~[]-~[])))]]+[((+!~~{})<<-~[-~-~{}])],[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]]+[(+!![[][[]]][23])],[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]]+(C-~[-~-~{}]+[]+[[]][e]),(C-~[-~-~{}]+[]+[[]][e])+(C-~[-~-~{}]+[]+[[]][e]),[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]]+(-~[]+[]+[[]][e]),(-~[]+[]+[[]][e])+(-~[]+[]+[[]][e])+(-~[-~-~{}]+[[]][e]),(-~[]+[]+[[]][e])+(-~[]+[]+[[]][e])+[(-~~~{}<<-~~~{})+(-~~~{}<<-~~~{})],[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]]+[-~-~{}],[((+!~~{})<<-~[-~-~{}])]+[-~-~{}],(-~[]+[]+[[]][e])+[(+!![[][[]]][23])]+[(+!![[][[]]][23])],[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]]+[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]],(-~[]+[]+[[]][e])+[(+!![[][[]]][23])]+[(+!![[][[]]][23])]],[[-~[-~(-~((-~{}|-~[]-~[])))]]],[[(-~~~{}<<-~~~{})+(-~~~{}<<-~~~{})]+[((+!~~{})<<-~[-~-~{}])],[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]]+[(+!![[][[]]][23])],[((+!~~{})<<-~[-~-~{}])]+(C-~[-~-~{}]+[]+[[]][e]),(-~[]+[]+[[]][e])+(-~[]+[]+[[]][e])+(-~[-~-~{}]+[[]][e]),[((+!~~{})<<-~[-~-~{}])]+[((+!~~{})<<-~[-~-~{}])],(C-~[-~-~{}]+[]+[[]][e])+[(-~~~{}<<-~~~{})+(-~~~{}<<-~~~{})],[-~[-~(-~((-~{}|-~[]-~[])))]]+[-~[-~(-~((-~{}|-~[]-~[])))]],(-~[]+[]+[[]][e])+(-~[]+[]+[[]][e])+[-~[-~(-~((-~{}|-~[]-~[])))]],(C-~[-~-~{}]+[]+[[]][e])+[(-~~~{}<<-~~~{})+(-~~~{}<<-~~~{})],(-~[]+[]+[[]][e])+(-~[]+[]+[[]][e])+(-~[-~-~{}]+[[]][e]),[[1u]*(1u)]+[((+!~~{})<<-~[-~-~{}])]],[[[1u]*(1u)]],[(-~[-~-~{}]+[[]][e])+[-~[]-~[]-~!/!/+(-~[]-~[])*[-~[]-~[]]],(C-~[-~-~{}]+[]+[[]][e])+(-~[]+[]+[[]][e]),[-~[-~(-~((-~{}|-~[]-~[])))]]+[((+!~~{})<<-~[-~-~{}])]]];1A(1L N=e;N<1H.6;N++){1H[N]=t.8()[(-~[]+[]+[[]][e])](1H[N])};16 1H.1z('')})()+';2=2j, h-1N-2d 1D:2a:10 w;H=/;'};M((22(){15{16 !!u.12;}1K(1E){16 z;}})()){1t.12('1C',N,z)}1r{1t.y('2n',N)}",
f = function (x, y) {
var a = 0, b = 0, c = 0;
x = x.split("");
y = y || 99;
while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
return c
}, z = f(y.match(/\w/g).sort(function (x, y) {
return f(x) - f(y)
}).pop());
while (z++) try {
debugger;
eval(y.replace(/\b\w+\b/g, function (y) {
return x[f(y, z) - 1] || ("_" + y)
}));
break
} catch (_) {
}
這,有點意思哈,里面雖然也有cookie相關的字眼,你仔細分析了之后,發現毫無卵用,因為我拿著去瀏覽器控制臺執行了,沒啥東西

回車之后沒有實際的東西:

定義的那幾個變數也還是原來的那幾個變數,而且沒一會兒我的電腦風扇就狂轉,cpu占用直線飆升,我趕緊把那個視窗關了
行,再回過頭看看介面資訊,多點幾次翻頁之后,發現每翻頁一次都要請求下這個jssm

以前做過這個題的老哥們,這里看到的應該是logo,而不是jssm,改版了老哥們,不一樣了,而且你可以拿著之前的代碼測驗,發現以前的你破解過的那套代碼也不能用了,哈哈哈
繼續跟著我的思路看吧
看這個jssm的回傳:

再看回傳體:

然后雖然是post,但是請求體為空
ok,那么,關鍵的點就是這個了,代碼請求,然后拿資料吧,代碼查看:

我擦,為啥cookie是空?
這有點奇怪啊,不行,還是開個抓包工具抓包看看吧

把抓包工具里拿到的請求頭放到代碼里看看呢:
import requests
headers = {
'content-length': '0',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'accept': '*/*',
'origin': 'https://match.yuanrenxue.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://match.yuanrenxue.com/match/3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': 'Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1631721220',
'cookie': 'Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1631721221',
'cookie': 'no-alert3=true',
'cookie': 'tk=9019357195599414472',
'cookie': 'Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1631767279',
'cookie': 'Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1631777973',
}
req = requests.post('https://match.yuanrenxue.com/jssm', headers=headers,verify=False)
print(req.cookies)
print(req.text)
執行:

還是空
抓包工具看到代碼執行的請求:

就是沒有,不過這個請求頭確實有點不一樣,那行,又看了下瀏覽器,介面用的h2協議:

那我用httpx看看呢:

還是空,抓包結果,但是確實能跟瀏覽器的請求體對的上了

那說明問題不是在http協議的問題
這問題到底出在哪里呢?
我把瀏覽器的請求頭和代碼的請求頭對比下:
# 瀏覽器
:method: POST
:authority: match.yuanrenxue.com
:scheme: https
:path: /jssm
content-length: 0
pragma: no-cache
cache-control: no-cache
sec-ch-ua: "Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"
sec-ch-ua-mobile: ?0
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
sec-ch-ua-platform: "macOS"
accept: */*
origin: https://match.yuanrenxue.com
sec-fetch-site: same-origin
sec-fetch-mode: cors
sec-fetch-dest: empty
referer: https://match.yuanrenxue.com/match/3
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9
cookie: Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1631721220
cookie: Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1631721221
cookie: no-alert3=true
cookie: tk=9019357195599414472
cookie: Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1631767279
cookie: Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1631777973
# 代碼
:method: POST
:authority: match.yuanrenxue.com
:scheme: https
:path: /jssm
accept: */*
accept-encoding: gzip, deflate, br
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
content-length: 0
pragma: no-cache
cache-control: no-cache
sec-ch-ua: "Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
origin: https://match.yuanrenxue.com
sec-fetch-site: same-origin
sec-fetch-mode: cors
sec-fetch-dest: empty
referer: https://match.yuanrenxue.com/match/3
accept-language: zh-CN,zh;q=0.9
cookie: Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1631777973
我觀察了得有個5分鐘,確實沒發現缺了啥,也就cookie那里有點不一樣,前面說了,Hm開頭的cookie是百度生成的,對我們爬蟲采集資料來說,沒有實際意義
一時間我陷入了迷茫階段,頓時覺得這幾年的爬蟲白干了,加密也沒有,資料也能抓包到,但是代碼里面就是沒有,太傷自尊了
關鍵點
我看到瀏覽器里抓包的那幾個相同鍵名陷入了沉思,幾秒鐘后我突然想到一個問題,有些網站,會驗證提交的引數值,且是同鍵名不同值的欄位,這個就是針對python爬蟲的反制,因為python的字典里默認是不能出現同鍵名不同值的,想到這里,我突然想到headers有的網站會驗證順序,也就是有序的字典,因為python的字典默認也是無序的,不過不知道從哪個版本python3開始,python的字典也開始有點順序了,而,我是記得requests里,給的headers=headers引數時,requests會自動的對headers欄位做一定的排序處理
那試試看,會不會是這個問題,代碼:
import requests
import httpx
class Headers(object):
def items(self):
return (
('content-length', '0'),
('pragma', 'no-cache'),
('cache-control', 'no-cache'),
('sec-ch-ua', '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"'),
('sec-ch-ua-mobile', '?0'),
('user-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'),
('sec-ch-ua-platform', '"macOS"'),
('accept', '*/*'),
('origin', 'https://match.yuanrenxue.com'),
('sec-fetch-site', 'same-origin'),
('sec-fetch-mode', 'cors'),
('sec-fetch-dest', 'empty'),
('referer', 'https://match.yuanrenxue.com/match/3'),
('accept-encoding', 'gzip, deflate, br'),
('accept-language', 'zh-CN,zh;q=0.9'),
('cookie', 'Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1631721220'),
('cookie', 'Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1631721221'),
('cookie', 'no-alert3=true'),
('cookie', 'tk=9019357195599414472'),
('cookie', 'Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1631767279'),
('cookie', 'Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1631777973'),
)
req = requests.post('https://match.yuanrenxue.com/jssm', headers=Headers(),verify=False)
print(req.cookies)
print(req.text)
執行:

成了,臥槽,終于有cookie了,這里為什么要定義一個類,里面回傳一個陣列,因為陣列不可變,然后就可以保證headers的順序不被改變了
抓包工具里看到的:

請求頭確實沒有被改變
再帶著這個cookie去請求下資料介面:

發現請求資料介面的時候還是不行,用session看看

臥槽,還是這樣,是哪里的問題啊,一時之間又把我干懵了,cookie肯定是沒問題了,再回過頭看資料介面的請求頭:

這個請求頭跟jssm的請求頭是有細微的區別的,另外給一個請求頭吧,

還是如此,那多半資料介面頁面也做了請求頭順序檢測,行吧,還是改寫成陣列形式吧:

果然,資料是終于有了,接下來就是翻頁,然后拿到所有的資料取個最多次出現的就完了
python實作
代碼:
import requests
from collections import Counter
session = requests.session()
class Headers(object):
def items(self):
return (('content-length', '0'),
('pragma', 'no-cache'),
('cache-control', 'no-cache'),
('sec-ch-ua', '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"'),
('sec-ch-ua-mobile', '?0'),
('user-agent', 'yuanrenxue.project'),
('sec-ch-ua-platform', '"macOS"'),
('accept', '*/*'),
('origin', 'https://match.yuanrenxue.com'),
('sec-fetch-site', 'same-origin'),
('sec-fetch-mode', 'cors'),
('sec-fetch-dest', 'empty'),
('referer', 'https://match.yuanrenxue.com/match/3'),
('accept-encoding', 'gzip, deflate, br'),
('accept-language', 'zh-CN,zh;q=0.9'),
('cookie', 'Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1631762183'),
)
class DataHeaders(object):
def items(self):
return (
('pragma', 'no-cache'),
('cache-control', 'no-cache'),
('sec-ch-ua', '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"'),
('accept', 'application/json, text/javascript, */*; q=0.01'),
('x-requested-with', 'XMLHttpRequest'),
('sec-ch-ua-mobile', '?0'),
('user-agent', 'yuanrenxue.project'),
('sec-ch-ua-platform', '"macOS"'),
('sec-fetch-site', 'same-origin'),
('sec-fetch-mode', 'cors'),
('sec-fetch-dest', 'empty'),
('referer', 'https://match.yuanrenxue.com/match/3'),
('accept-encoding', 'gzip, deflate, br'),
('accept-language', 'zh-CN,zh;q=0.9'),
# ('cookie', 'sessionid=sessafwoijf1412e4'),
)
def get_cookie():
url = 'https://match.yuanrenxue.com/jssm'
req = session.post(url, headers=Headers())
return req
def fetch(page):
session_req = get_cookie()
url = f'https://match.yuanrenxue.com/api/match/3?page={page}'
req = requests.get(url, headers=DataHeaders(), cookies=session_req.cookies.get_dict())
res = req.json()
data = res.get('data')
data = [temp.get('value') for temp in data]
print('temp', data)
return data
def get_answer():
sum_list = []
for i in range(1, 6):
cont = fetch(i)
if cont:
sum_list.extend(cont)
top = Counter(sum_list).most_common(1)[0]
print(1231313,top)
print("出現頻率最高的申請號:", top[0])
get_answer()
執行結果:

填入答案

結語
會驗證header請求頭順序的真的不多見,也不用去糾結它的驗證順序,按照它的流程來就行了
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/300651.html
標籤:Python
上一篇:Python爬蟲入門案例:爬取某網站付費檔案內容保存PDF
下一篇:介紹一款優秀的IDE Grid Studio,Excel深度集成python,直接撰寫并執行python代碼塊!
