遙遙微光,與我同行

好久不見,各位小伙伴們!嗐,春節真滴快啊!祝大家新年快樂!
書山有路勤為徑,學海無涯苦作舟!又得開始愉快滴學習了!
小夜斗今天給大家伙分享一期干貨,蕪湖起飛!
JS逆向網易云爬取評論并利用snownpl進行情感分析
一:逆向破解網易云引數抓取評論資訊
網易云PC端url: https://music.163.com/#/song?id=1817702136

要抓取滴評論如下圖所示:

老規矩,檢查網頁元素,找到評論資訊所在的請求網址!
從xhr里面找一下子就能找到,看下面截圖:

如果直接請求這個網址的話,是拿不到上面的評論資訊的,因為這個網址有兩個動態加密的引數:params、encSecKey

請求這個方有評論資訊的url,我們需要上述兩個引數構建表單發送POST請求,現在我們需要做的事情就是探索這兩個引數是如何生產的,最后拿到它倆,構造自己的表單,發送POST請求獲取回應!
第一個辦法: 分析網頁原始碼,找到生產引數所需要的方法,利用網頁自身的代碼拿到兩個引數即可!(一般是分析javascript代碼,俗稱js逆向)
第二個辦法: 了解這個網頁js代碼如何生產的這兩個引數,并利用python仿寫js代碼所具有的功能,自己構造倆個引數!
我們先從一大堆請求中找到帶有 params、encSecKey引數檔案
按下ctrl + F 搜索params, 找到箭頭所指的2檔案!

點擊源代碼進入js檔案,并點擊格式化js代碼,格式化后如圖二
圖一:

圖二:格式化后輸入params找到其位置,一步步分析如何生產!

好啦,到了最關鍵的地步,逆向分析這倆引數是如何加密產生的!
第一步, 找到生成這兩個引數的js代碼,如下所示:

把js代碼扣下來看:
e5j.data = j5o.cr6l({
params: bWv4z.encText,
encSecKey: bWv4z.encSecKey
})
}
看起來是’bWv4z’這個物件呼叫encText和encSecKey這兩個方法分別生產的params和encSecKey
第二步: 然后我們找找 bWv4z 物件是怎么生成的
var bWv4z = window.asrsea(JSON.stringify(i5n), bsK6E(["流淚", "強"]), bsK6E(XR1x.md), bsK6E(["愛心", "女孩", "驚恐", "大笑"]));
一個window.asrsea物件中傳入4個引數生成bWv4z物件,其實這個時候可以先分析傳入的四個引數是什么,或者先找到window.asrsea物件是如何生產的,這里我們先看后者
第三步: window.asrsea是如何產生的

通過上圖我們知道,這個物件是由 d 產生的,一開始小夜斗也不知道d是個什么東西,通過搜索后發現,d是一個方法, 如下圖所示:

這我們就知道了,window.asrsea相當于是d方法賦值(不太懂js代碼,小夜斗自己是這么理解的),然后window.asrsea傳入的四個引數就相當于呼叫d中需要傳入的四個引數!
讓我們打個斷點看看,按下f5重繪頁面,看看d中傳入的四個引數!

d函式中四個引數如下圖所示:
d: "{"csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
我們再來看一下window.asrsea 其中四個引數分別是上述: d e f g
var bWv4z = window.asrsea(JSON.stringify(i5n), bsK6E(["流淚", "強"]), bsK6E(XR1x.md), bsK6E(["愛心", "女孩", "驚恐", "大笑"]));
其中呢我們看第一個引數JSON.stringify(i5n)對應的是d,大概就是將i5n轉化為json格式吧,我們打個斷點看看最后i5n是什么!

i5n = {csrf_token: "d4339865ec133c9a7d77a25389bc0265"}
// 上下對比發現JSON.stringify(i5n)是將i5n轉化為json格式
d: "{"csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
小夜斗換了一首歌發現,上面這四個引數都是固定的:
url: https://music.163.com/#/song?id=1820887593
d: "{"csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
重新打斷點,選擇第二頁后有了新的發現!
注意,打第二頁斷點的時候,要先打斷點,f5重繪后會跳轉到第一頁,之個時候你在選擇第二頁,就會加載引數內容了!

d: "{"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
讓我們看看d: 里面幾個引數的含義
“rid”:“R_SO_4_1817702136” 后面這個數字是網頁url后面的id (根據id變換)
“threadId”:“R_SO_4_1817702136” 同上 (根據id 變換)
“pageNo”:“2” 頁碼數 (變數)
“pageSize”:“20” 每一頁評論的數量 常量
“cursor”:“1613900247044” 應該是時間戳13位 (變數)
“offset”:“40” 偏移量 (頁碼數 * 20) (變數)
“orderType”:“1” 估計是啥型別是個常量
“csrf_token”:“d4339865ec133c9a7d77a25389bc0265”} 同樣是個常量
好勒,了解了四個引數后我們可以看看d函式內部到底做了啥事情!
第四步: d函式內部到底做了啥事情!

把js代碼扣下來如下所示:
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g),
h.encText = b(h.encText, i),
h.encSecKey = c(i, e, f),
h
}
定義了一個字典h, 變數i的值是a(16)
了解后發現,a、b、c、d都是函式

首先看看函式a內部:
function a(a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
for (d = 0; a > d; d += 1)
e = Math.random() * b.length,
e = Math.floor(e),
c += b.charAt(e);
return c
}
乍一看,我滴天這是啥子東西,別急我們用pycharm來執行這個js代碼即可知道回傳的c是個什么東西,即可倒推這個函式的功能得到i
因為將網頁原始碼的js代碼copy到pycharm里面執行會因為某些換行符報錯,小夜斗就將js代碼copy到了下面這個軟體: 好像是前端用滴!

pycharm中執行js檔案的代碼如下:
pip install PyExecJS # 安裝執行js代碼的庫
#-*- coding: utf-8 -*-
# TODO: 正確的js代碼里面(網頁上copy的有換行符)
# 解決辦法,先將代碼copy到HBUILD里面, 然后執行js代碼
import execjs
import requests
js = open('./analysis_3.js', 'r', encoding='utf8').read()
aim = execjs.compile(js) # 生產js物件
data = aim.call('example') # 呼叫相應方法
print(data) # 輸出結果
結果如下所示: 每次生產不一樣的長度為16的字串!估計就是從a函式那個很長的字串中隨機選擇16個字串然后拼接在一起吧,小夜斗猜測這就是a函式的功能!

結果2如下圖所示:

下面是第一次為了獲得變數i值扣下來的js代碼(analysis_3.js):
/*
d: "{"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
*/
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), // 執行到此已經獲取到變數i的值
h.encText = b(h.encText, i),
h.encSecKey = c(i, e, f),
h
}
function a(a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
for (d = 0; a > d; d += 1)
e = Math.random() * b.length,
e = Math.floor(e),
c += b.charAt(e);
return c
}
function example(){
i = a(16);
return i;
}
然后我們再回到d函式內部,執行代碼h.encText = b(d, g),即我們需要呼叫b函式,其中兩個引數分別為d,g,這倆引數我們都能構造的知,問題不大!繼續扣js代碼!
先將函式b扣下來看看:
/*
d: "{"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
*/
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), // 執行到此已經獲取到變數i的值
h.encText = b(h.encText, i),
h.encSecKey = c(i, e, f),
h
}
function a(a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
for (d = 0; a > d; d += 1)
e = Math.random() * b.length,
e = Math.floor(e),
c += b.charAt(e);
return c
}
function b(a, b) {
var c = CryptoJS.enc.Utf8.parse(b)
, d = CryptoJS.enc.Utf8.parse("0102030405060708")
, e = CryptoJS.enc.Utf8.parse(a)
, f = CryptoJS.AES.encrypt(e, c, {
iv: d,
mode: CryptoJS.mode.CBC
});
return f.toString()
}
function example(){
i = a(16);
h = {}
h.encText = b({"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}, "0CoJUm6Qyw8W8jud")
return h;
}
從Pycharm執行這個js檔案發現報錯: CryptoJS is not defined

就是js代碼中少了**CryptoJS **這個函式功能,問題不大我們從js原始碼中扣下來即可!就搜這個函式名字,然后找到看起來像這個函式復制下來即可!不難!

第三次扣下來的js代碼如下所示:
/*
d: "{"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
*/
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), // 執行到此已經獲取到變數i的值
h.encText = b(h.encText, i),
h.encSecKey = c(i, e, f),
h
}
function a(a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
for (d = 0; a > d; d += 1)
e = Math.random() * b.length,
e = Math.floor(e),
c += b.charAt(e);
return c
}
function b(a, b) {
var c = CryptoJS.enc.Utf8.parse(b)
, d = CryptoJS.enc.Utf8.parse("0102030405060708")
, e = CryptoJS.enc.Utf8.parse(a)
, f = CryptoJS.AES.encrypt(e, c, {
iv: d,
mode: CryptoJS.mode.CBC
});
return f.toString()
}
var CryptoJS = CryptoJS || function(u, p) {
var d = {}
, l = d.lib = {}
, s = function() {}
, t = l.Base = {
extend: function(a) {
s.prototype = this;
var c = new s;
a && c.mixIn(a);
c.hasOwnProperty("init") || (c.init = function() {
c.$super.init.apply(this, arguments)
}
);
c.init.prototype = c;
c.$super = this;
return c
},
create: function() {
var a = this.extend();
a.init.apply(a, arguments);
return a
},
init: function() {},
mixIn: function(a) {
for (var c in a)
a.hasOwnProperty(c) && (this[c] = a[c]);
a.hasOwnProperty("toString") && (this.toString = a.toString)
},
clone: function() {
return this.init.prototype.extend(this)
}
}
, r = l.WordArray = t.extend({
init: function(a, c) {
a = this.words = a || [];
this.sigBytes = c != p ? c : 4 * a.length
},
toString: function(a) {
return (a || v).stringify(this)
},
concat: function(a) {
var c = this.words
, e = a.words
, j = this.sigBytes;
a = a.sigBytes;
this.clamp();
if (j % 4)
for (var k = 0; k < a; k++)
c[j + k >>> 2] |= (e[k >>> 2] >>> 24 - 8 * (k % 4) & 255) << 24 - 8 * ((j + k) % 4);
else if (65535 < e.length)
for (k = 0; k < a; k += 4)
c[j + k >>> 2] = e[k >>> 2];
else
c.push.apply(c, e);
this.sigBytes += a;
return this
},
clamp: function() {
var a = this.words
, c = this.sigBytes;
a[c >>> 2] &= 4294967295 << 32 - 8 * (c % 4);
a.length = u.ceil(c / 4)
},
clone: function() {
var a = t.clone.call(this);
a.words = this.words.slice(0);
return a
},
random: function(a) {
for (var c = [], e = 0; e < a; e += 4)
c.push(4294967296 * u.random() | 0);
return new r.init(c,a)
}
})
, w = d.enc = {}
, v = w.Hex = {
stringify: function(a) {
var c = a.words;
a = a.sigBytes;
for (var e = [], j = 0; j < a; j++) {
var k = c[j >>> 2] >>> 24 - 8 * (j % 4) & 255;
e.push((k >>> 4).toString(16));
e.push((k & 15).toString(16))
}
return e.join("")
},
parse: function(a) {
for (var c = a.length, e = [], j = 0; j < c; j += 2)
e[j >>> 3] |= parseInt(a.substr(j, 2), 16) << 24 - 4 * (j % 8);
return new r.init(e,c / 2)
}
}
, b = w.Latin1 = {
stringify: function(a) {
var c = a.words;
a = a.sigBytes;
for (var e = [], j = 0; j < a; j++)
e.push(String.fromCharCode(c[j >>> 2] >>> 24 - 8 * (j % 4) & 255));
return e.join("")
},
parse: function(a) {
for (var c = a.length, e = [], j = 0; j < c; j++)
e[j >>> 2] |= (a.charCodeAt(j) & 255) << 24 - 8 * (j % 4);
return new r.init(e,c)
}
}
, x = w.Utf8 = {
stringify: function(a) {
try {
return decodeURIComponent(escape(b.stringify(a)))
} catch (c) {
throw Error("Malformed UTF-8 data")
}
},
parse: function(a) {
return b.parse(unescape(encodeURIComponent(a)))
}
}
, q = l.BufferedBlockAlgorithm = t.extend({
reset: function() {
this.i5n = new r.init;
this.sN2x = 0
},
vE3x: function(a) {
"string" == typeof a && (a = x.parse(a));
this.i5n.concat(a);
this.sN2x += a.sigBytes
},
lg9X: function(a) {
var c = this.i5n
, e = c.words
, j = c.sigBytes
, k = this.blockSize
, b = j / (4 * k)
, b = a ? u.ceil(b) : u.max((b | 0) - this.GX6R, 0);
a = b * k;
j = u.min(4 * a, j);
if (a) {
for (var q = 0; q < a; q += k)
this.qB1x(e, q);
q = e.splice(0, a);
c.sigBytes -= j
}
return new r.init(q,j)
},
clone: function() {
var a = t.clone.call(this);
a.i5n = this.i5n.clone();
return a
},
GX6R: 0
});
l.Hasher = q.extend({
cfg: t.extend(),
init: function(a) {
this.cfg = this.cfg.extend(a);
this.reset()
},
reset: function() {
q.reset.call(this);
this.lJ9A()
},
update: function(a) {
this.vE3x(a);
this.lg9X();
return this
},
finalize: function(a) {
a && this.vE3x(a);
return this.mG0x()
},
blockSize: 16,
lV0x: function(a) {
return function(b, e) {
return (new a.init(e)).finalize(b)
}
},
vC3x: function(a) {
return function(b, e) {
return (new n.HMAC.init(a,e)).finalize(b)
}
}
});
var n = d.algo = {};
return d
}(Math);
function example(){
i = a(16);
h = {}
h.encText = b({"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}, "0CoJUm6Qyw8W8jud")
return h;
}
嗐,又報錯了,還是代碼功能不全,這次又是缺少相應的函式:

Cannot read property ‘encrypt’ of undefined
莫得辦法繼續找代碼,cv代碼唄,逆向就是這樣子要有耐心…
老樣子,搜索encrypt這個引數,不過這個引數很奇怪,沒有找到相應的生產函式,但是好像要聯合幾個函式一起執行,小夜斗就都給弄下來了!


嗐篇幅太長了,這里就不上js代碼了,后面文末自行獲取即可!
上主要功能的js代碼吧,不然不好解釋:
function example(){
i = a(16);
h = {}
h.encText = b({"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}, "0CoJUm6Qyw8W8jud")
return h;
}
/*
最后回傳的h為: {'encText': 'ef07aa03b0c145b18ff7093a3ebe8428'}
*/
此時d函式內部
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), // 執行到此已經獲取到變數i的值
h.encText = b(h.encText, i), // 開始執行這個步驟
h.encSecKey = c(i, e, f),
h
}
js代碼主函式功能:
function example(){
i = a(16);
h = {}
h.encText = b({"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}, "0CoJUm6Qyw8W8jud")
h.encText = b(h.encText, i)
return h;
}
/*
最后回傳的h為: {'encText': '09e1281f3bc506817b18eb63b2249c35dd295be25fe3de348d06930e670f81a9aa312bb4033e4f372964914b95d5f686'}
每一次執行都是不一樣的,動態變化!
*/
此時d函式內部
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), // 執行到此已經獲取到變數i的值
h.encText = b(h.encText, i), // 開始執行這個步驟
h.encSecKey = c(i, e, f), // 開始執行這個步驟, 呼叫c函式
h
}
js代碼主函式功能:
function example(){
i = a(16);
h = {}
h.encText = b({"rid":"R_SO_4_1817702136","threadId":"R_SO_4_1817702136","pageNo":"2","pageSize":"20","cursor":"1613900247044","offset":"40","orderType":"1","csrf_token":"d4339865ec133c9a7d77a25389bc0265"}, "0CoJUm6Qyw8W8jud")
h.encText = b(h.encText, i)
h.encSecKey = c(i, "010001" , "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7")
return h;
}
執行后,報錯了,因為缺少了沒有定義的函式,然后后面一系列連續的這樣子的報錯,小夜斗自己弄了一個多小時才把函式補充完全,然后直接截圖給大家伙看吧!


嗐,其實最后總結發現就是從那個!上面那里一直復制到第一次報錯所需要的函式那里!
其實最后總結發現:
function d(d, e, f, g) {
var h = {}
, i = a(16);
return h.encText = b(d, g), // 執行到此已經獲取到變數i的值
h.encText = b(h.encText, i), // 開始執行這個步驟
h.encSecKey = c(i, e, f), // 開始執行這個步驟
h
}
asrsea = d;
var bWv4z = asrsea(JSON.stringify(d), e, f, g);
核心思想: 最終就是呼叫d這個函式,傳入四個引數,最后得到的值是一個字典賦值給bWv4z, 然后b函式內部就牽扯到其它需要的函式,缺啥補啥就行!
var data = {
params: bWv4z.encText,
encSecKey: bWv4z.encSecKey
}
然后這個字典中的encText和encSecKey對應的值就是params和encSecKey倆加密引數
{'params': 'GeZ2hGQu0LGQlB4VQebjp6n74Oq4/32rvafzEjRm9YSwMU7MBR9hC8f4riioTrVZien4zLXoPv+AVMUy5YV0Z/57uz6MbnX6pcyS99OSzJcvbBzgM5oTFpS2faYdUCieyRYIWmna8c9SwS/yE+/EsaA3GMRpXoMhnV1ibdUY0/NUuDT5QpXjlNirryMJN0N66FvDT3yPS1aVEuCiEE9h3833g107ljF8vEkguSOBxi7eRMgT2W1nz9HQNJU5pniYsc8ntMeQESk4NblkNnEx6307E3uxMeAST2uJPchaTc4tb+TcDlZN/PLpz2OV62hJic9dNEfaxic7Jybvtn+I6lyyrD11x4xe4b7s915g5eo=', 'encSecKey': '8e8659fcff20f47c9823685b6b86cf976f7d7bfa9db447e3a8437839c0ed7837d529c9c7c245c9807f3277c85c6141f2621ad916c81d5db964eb56282d016142e4058db17aafb7bca8869b3fa537ba7422b347731526cc86e3c117277e3a569348ed51da09e5331ce3fad1c381c17fc0bef001a43cae46a22a48329c554c0f56'}
將這個倆引數拿到然后構建表單發送post請求
就拿到了如圖所示的評論資訊!

python代碼如下:
#coding:utf8
# TODO: 正確的js代碼里面(網頁上copy的有換行符)
# 解決辦法,先將代碼copy到HBUILD里面, 然后執行js代碼
import execjs
import requests
import json
import os,sys,io
# TODO: UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f44c' in position 3151: illegal multibyte sequence 報錯
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
js = open('./網易云.js', 'r', encoding='utf8').read()
aim = execjs.compile(js) # 生產js物件
data = aim.call('start') # 呼叫相應方法
print(data) # 輸出結果
# comment介面
url = "https://music.163.com/weapi/comment/resource/comments/get?csrf_token=d4339865ec133c9a7d77a25389bc0265"
# 請求頭
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74',
'cookie': 'WM_TID=6S%2BVxiZNM4xBURVBQUYqChbHQqRPtWJo; _iuqxldmzr_=32; ntes_kaola_ad=1; _ntes_nnid=4a37e82e4fab3d88933e0d5c379d440e,1593854146443; _ntes_nuid=4a37e82e4fab3d88933e0d5c379d440e; vinfo_n_f_l_n3=65b5de1ae241f0fb.1.2.1596075019898.1596077656335.1596082873052; UM_distinctid=1776c974d24410-0bb2c863a112f2-50391c40-1fa400-1776c974d25bd1; NMTID=00OlBs1TG5OHV33AkargjKWlPeNAmEAAAF3i9nSKg; __csrf=d4339865ec133c9a7d77a25389bc0265; MUSIC_U=bed21edcbf5c4808fefc260d37faef01b6d633afc3be6807d7f44dcfc98c3d5433a649814e309366; WM_NI=FH%2Bdu5qU8nodD0Tz91k%2F%2BSZTQaN1uVEUrJ6lgO3CNyrzxn2qHc4gqOO5ytsdhxlqnw%2Fr02fjvENiMGXplNN0Y%2BehIKexHjWU%2FJN05AfaCWE%2F9GO4sfosi0qw38qX3M8KaWs%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6ee92fc6292f5ff89ca448cb08fa3d54a839e8e84f56ff58b8491b673b2b98ab3b42af0fea7c3b92a8698e585f83fb38bf88fee7088f09c9bbc3ba9f1e5b4d96093bf82d2f85e878bbeabb568a591fda4cd73b2a6b88fb17b8cbdfc88f4458debb795f270aca8a492ed7dedae00d3e562b69b828ad04d9cb0a4afcd3c9bf5ab89cc41a7a8888bd65fed9682d1b84ba6bf8db8c825fb9981d5f26eadada7d1f47df1e8bfb9b247fc95afb8f237e2a3; JSESSIONID-WYYY=1bOvTxGdlne5T%2B%5CvUNf761crAP4mCs4kXFSp25NsXXnrNWkrO5Tk8p5ykpnsb9X%2B%2F1ofrjxduvuZfC2kPNgz40bpqqlCgYlf1f2cXl%5C1yRO8aZ0IlubEo7n7xs0AX%2FffBDdG5t12CuiOPHI%5CZPleGyhEmbbOJt%2Bkt6XxSohZCQiQmPWr%3A1613963805180; WEVNSM=1.0.0; WNMCID=iuuttf.1613962005402.01.0',
'referer': 'https://music.163.com/song?id=1817702136',
}
# POST請求
r = requests.post(url=url, headers=headers, data=data)
# 狀態碼200: 訪問成功
if r.status_code == 200:
print("成功訪問網易云介面")
text = r.text.encode('utf8', "ignore").decode('utf8', "ignore")
# json格式轉換
content = json.loads(text)
# print(content)
# 獲取評論資訊: csv 與 txt 各一份
user_list = content['data']['comments']
# print(user_list)
for user in user_list:
# 主評論
if user['beReplied'] != None:
for item in user['beReplied']:
print(item['content'])
# 回復評論
# print(user['content'])
else:
print("error")
ps : 逆向程序的基本上是去扣原始碼或者自己用python或其他語言復刻出原始碼中所需要的功能來得到你加密的引數,少什么函式就扣下來放到js檔案中,就這樣一部部來,總之很麻煩!
上述只是爬取一頁的代碼,如果需要拿去多頁的話,就用重寫函式定義傳參即可,d中滴引數上述已經描述了!
然后小夜斗這邊試了一下,希望有大佬看到能夠指點一下,為什么到了第十頁就爬取不了了!歡迎各位大佬評論區留言
小夜斗的python代碼如下:
#coding:utf8
# TODO: 正確的js代碼里面(網頁上copy的有換行符)
# 解決辦法,先將代碼copy到HBUILD里面, 然后執行js代碼
import execjs
import requests
import json
import os,sys,io
import time
# TODO: UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f44c' in position 3151: illegal multibyte sequence 報錯
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')
page = int(input("輸入要查詢的頁碼數:"))
for i in range(1, page+1):
print(f'i:{i}')
# 偏移量
offset = str(i * 20)
# offset = str(0)
# 13為時間戳
cursor = str(int(time.time() * 1000))
js = open('./網易云.js', 'r', encoding='utf8').read()
aim = execjs.compile(js) # 生產js物件
data = aim.call('start', offset, str(page), cursor) # 呼叫相應方法
# print(data) # 輸出結果
# comment介面
url = "https://music.163.com/weapi/comment/resource/comments/get?csrf_token=d4339865ec133c9a7d77a25389bc0265"
# 請求頭
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74',
'cookie': 'WM_TID=6S%2BVxiZNM4xBURVBQUYqChbHQqRPtWJo; _iuqxldmzr_=32; ntes_kaola_ad=1; _ntes_nnid=4a37e82e4fab3d88933e0d5c379d440e,1593854146443; _ntes_nuid=4a37e82e4fab3d88933e0d5c379d440e; vinfo_n_f_l_n3=65b5de1ae241f0fb.1.2.1596075019898.1596077656335.1596082873052; UM_distinctid=1776c974d24410-0bb2c863a112f2-50391c40-1fa400-1776c974d25bd1; NMTID=00OlBs1TG5OHV33AkargjKWlPeNAmEAAAF3i9nSKg; __csrf=d4339865ec133c9a7d77a25389bc0265; MUSIC_U=bed21edcbf5c4808fefc260d37faef01b6d633afc3be6807d7f44dcfc98c3d5433a649814e309366; WM_NI=d1%2F2ZoLk6b6YJnwdLJEo2E4vkr7u2MMvjXLw34zWu7E15cNm%2BUQL7G14j36Nw4Y1JMsQZzutuQr%2FBeI1EiTJIp7tOnFsp%2F63a6a8MFDWIwVCaeh7P9%2FjfqQJTf28V7XVY3I%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6eeb5b769b097faa2f64eae868eb2c54b979e9e85b645f58c8ab5c67a8fb197b5aa2af0fea7c3b92aa8efe1d3d96590b5fa83b5538ab2b985cf4287b1fea6b77ea9aef996b1478faaa6b9cf7df1ecb6b8f83d87effb84cd67e98b9fadb64df3eca6d4b546b5ec97d2c947b18f969bf880b78fbed0c16f8bef8b8ad7428bb0f995ce6686adc0d0b539a3eaafd9d04e88898798d549a3bba8d8d333edb6aa92d44298bafed9d53d9c8a97d4cc37e2a3; __root_domain_v=.163.com; _qddaz=QD.uerirb.s0cddl.klgiphn9; hb_MA-9F44-2FC2BD04228F_source=www.baidu.com; JSESSIONID-WYYY=Dnl9Q%5CsA%5CgFxr35Z6Yop8S4cgUw2XbSi63P5%5CylO1H9i%5CiJfKGDN7wYoc6nyGftQ6UtwmY5A6PTypX2mR147jRrZkF2zDqwuyQA4%2F%2F3CQnOhFKXz57z8WCCHeX9%5Cz%2BtVzo%2Byu%2B5un%5CR4af37PI%2Bxj7FNIbe9dCG9pRuMZ%2Bv4mGhyJG90%3A1613996422319; WEVNSM=1.0.0; WNMCID=sqowiw.1613994622490.01.0',
'referer': 'https://music.163.com/song?id=1817702136',
}
# POST請求
r = requests.post(url=url, headers=headers, data=data)
# 狀態碼200: 訪問成功
if r.status_code == 200:
# print("成功訪問網易云介面")
text = r.text.encode('utf8', "ignore").decode('utf8', "ignore")
# json格式轉換
content = json.loads(text)
# print(content)
# 獲取評論資訊: csv 與 txt 各一份
user_list = content['data']['comments']
# print(user_list)
for user in user_list:
# 主評論
if user['beReplied'] != None:
for item in user['beReplied']:
print(item['content'])
with open('星辰大海.txt', 'a', encoding="utf8") as f:
f.write(item['content'])
f.write('\n')
f.write(user['content'])
f.write('\n')
# 回復評論
print(user['content'])
time.sleep(0.5)
else:
print("error")

二:利用snownpl進行情感分析
首先沒有安裝這個庫的小伙伴安裝可以先安裝!
pip install snownlp
這個庫怎么說呢,小夜斗感覺就是專門針對中文進行情感判斷的,因為這篇重心是在于js逆向,情感分析就簡單點啦!
首先我們來看這個庫的基本使用:
from snownlp import SnowNLP
# 進行分詞
sentence = u'歡迎大家訂閱小夜斗的博客'
# 呼叫函式生產分詞物件
word_list = SnowNLP(sentence)
print(word_list) # <snownlp.SnowNLP object at 0x000001C49B6E8D68>
print(word_list.words) # ['歡迎', '大家', '訂閱', '小夜', '斗', '的', '博', '客']
# 用空格進行分隔
print(' '.join(word_list.words)) # 歡迎 大家 訂閱 小夜 斗 的 博 客
# 判斷這句話情感分數
emotion_score = word_list.sentiments
print(emotion_score) # 0.2263967191890076
看樣子這句話情感分數不是很高哈,不知道是不是沒有調參的原因,問題不大,今天小夜斗就簡單介紹這個庫的使用哈!

下面我們再來看一句話
# 表白神句
sentence_2 = "我喜歡你,我想和你在一起!"
word_list_2 = SnowNLP(sentence_2)
emotion_score_2 = word_list_2.sentiments
print(emotion_score_2) # 0.7144047112193267
嗯看起來還行,0.7的評分,最高也就是1了!
# 難受神句
sentence_3 = "我討厭你,我不想和你在一起!"
word_list_3 = SnowNLP(sentence_3)
emotion_score_3 = word_list_3.sentiments
print(emotion_score_3) # 0.3736912010075014
小夜斗此時哭暈在廁所,我的天這句討厭人的話評分竟然比小夜斗歡迎人的話還高,嗯!肯定是因為沒有經過模型訓練的原因,對!
問題不大問題不大,接下來生成一個簡單詞云看看!
# -*- coding: utf-8 -*-
import jieba
import sys
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# 打開本體TXT檔案
text = open('星辰大海.txt', encoding='utf8').read()
print(type(text))
# 結巴分詞 cut_all=True 設定為精準模式
wordlist = jieba.cut(text, cut_all=False)
# 使用空格連接 進行中文分詞
wl_space_split = " ".join(wordlist)
print(wl_space_split)
# 對分詞后的文本生成詞云
wc = WordCloud(
background_color="white", # 背景顏色
max_words=200, # 顯示最大次數
font_path=r'C:/Windows/Fonts/STXINGKA.TTF', # 字體
width=400, # 寬
height=200, # 高
scale=10).generate(wl_space_split)
# 顯示詞云圖
plt.imshow(wc)
# 是否顯示x軸、y軸下標
plt.axis("off")
plt.show()
星辰大海這首歌呢,雙向奔赴,是一首愛情歌曲吧,9頁評論出現最多的是喜歡、驚喜、花,這就是愛情嘛!

打開星辰大海.txt這個評論文本檔案,對每其中每一句評論進行情感分析,進而畫出一個柱狀圖,看整個歌曲的情感傾斜如何!
# -*- coding: utf-8 -*-
# 1.情感各分數段出現頻率
from snownlp import SnowNLP
import codecs
import os
source = open("星辰大海.txt", "r", encoding="utf8")
line = source.readlines()
sentimentslist = []
for i in line:
s = SnowNLP(i.encode("utf-8").decode("utf-8"))
print(s.sentiments)
sentimentslist.append(s.sentiments)
import matplotlib.pyplot as plt
import numpy as np
# 加載字體
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默認字體
# 顯示負號
plt.rcParams['axes.unicode_minus'] = False
plt.hist(sentimentslist, bins=np.arange(0, 1, 0.01), facecolor='g')
plt.xlabel('情感評分', size=12)
plt.ylabel('某個情感評分的數量', size=12)
plt.title('星辰大海整體情感分析', color="red", size=12)
plt.show()

星辰大海這首歌呢強調的是愛情是雙向奔赴,從個人主觀判斷上來看整首歌更傾向于正面積極、但也不缺乏某些因為愛情感到難過悲傷的人,從上述柱狀圖來看大致分布是在[0.5, 1]區間之間,其中[0.8,1]區間的人數占比較大,消極評論[0,0.2]只有那么一兩個,可能是因為被愛情傷透了心吧!
有關資料挖掘知識給大家伙介紹一個寶藏博主:
https://blog.csdn.net/eastmount/article/details/52577215
好啦,本期博客就到此啦,感興趣的小伙伴們不煩點贊收藏一波!
原始碼資料:關注微信公眾號"夜斗小神社"后臺回復"007網易資料"
- 在這個星球上,你很重要,請珍惜你的珍貴! ~~~夜斗小神社

轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/262927.html
標籤:python
