本文的文字及圖片來源于網路,僅供學習、交流使用,不具有任何商業用途,著作權歸原作者所有,如有問題請及時聯系我們以作處理

以下文章來源于青燈編程，作者：清風

Python爬蟲進階：反反爬實戰案例—爬取梨視頻，觀看地址：

https://www.bilibili.com/video/BV1mK4y1E75Y/

前言

關于梨視頻的爬取，網站上面還是有很多教程文章的，但是之前的那些教程文章統統都不能實作了，因為梨視頻網站早就更新了，之前也有很多小伙伴也在問關于該網站的爬取方法，

基本開發環境

Python 3.6
Pycharm

目標網頁分析

單個視頻地址獲取

點擊進入第一個視頻的詳情頁，使用開發者工具可以找到相關的視頻地址，

https://video.pearvideo.com/mp4/adshort/20201221/1608712845841-15540331_adpkg-ad_hd.mp4

鏈接中的 contId 對應的就是視頻的ID

def get_video_url(video_id):
    data_url = 'https://www.pearvideo.com/videoStatus.jsp?contId={}&mrd=0.5606814943122209'.format(video_id)
    headers_1 = {
        'Referer': 'https://www.pearvideo.com/video_{}'.format(video_id),
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=data_url, headers=headers_1)
    html_data = response.json()
    video_url = html_data['videoInfo']['videos']['srcUrl']
    suffix = video_url.split('-')[1]
    date = video_url.split('/')[-2]
    new_url = 'https://video.pearvideo.com/mp4/adshort/{}/cont-{}-{}-ad_hd.mp4'.format(date, video_id, suffix)
    return new_url

注意點：

1、headers 請求頭一定要添加 Referer 防盜鏈 , 不然獲取不到資料，

2、這里獲取的視頻url地址，并非真實的url地址，復制鏈接會發現是404

# 資料介面獲取的視頻地址
'''
https://video.pearvideo.com/mp4/adshort/20201222/1608723660585-15542059_adpkg-ad_hd.mp4
'''
# 真實的視頻播放地址
'''
https://video.pearvideo.com/mp4/adshort/20201222/cont-1712976-15542059_adpkg-ad_hd.mp4
'''

所以需要拼接一下url

獲取視頻ID

根據資料介面url 可知，只要獲取每個視頻的ID值，就可以爬取所有的視頻了，

這就需要在排行榜串列中查找了，

如上圖所示，視頻排行榜頁面有我們需要的視頻ID值，但是這只有一個10個視頻地址呀，需要網頁往下滑動才可以查看更多的內容，所以老辦法，先清空開發者工具里面的資料，然后下滑網頁，

于是乎就出現了很多的相關資料，經過分析對比，

可以發現其中是有兩個引數發現改變的：

start：等引數列，每次增加10

sort：9、14、21、29、38、46、54、62、69、74 每次變化沒有規律

那這樣我只能讓不規律的哪個引數保持不變，給一個恒定值9，然后根據start的引數給改變，看是否能夠獲取資料，咱們就以不變應萬變，

當URL中 start=0 時

可以看出，最后一個視頻是<2020回聲：50段現場聲音回顧這一年> ，在排行榜當中應該是屬于第9個的位置，但是這里的序號是18.

當URL中 sort=10 時

第一個視頻是<被害者家屬談勞榮枝案首日庭審> ，在排行榜當中應該是屬于第10個的位置，剛好是接著上面的視頻往下排的，，

當URL中 sort=20 時

同樣的是接著排下來的，

所以只需要改變 sort 這個引數即可獲取排行榜所有的視頻ID以及視頻標題，

def get_video_id(page_url):
    html_data = get_response(page_url).text
    video_ids = re.findall('<a href=https://www.cnblogs.com/hhh188764/archive/2020/12/24/"video_(\d+)" class="actplay">', html_data)
    title = re.findall('<h2 class="popularem-title">(.*?)</h2>', html_data)
    video_info = zip(video_ids, title)
    for i in video_info:
        video_title = i[1]
        video_id = i[0]

保存資料

def save(video_url, video_title):
    video_content = get_response(video_url).content
    filename = 'video\\' + video_title + '.mp4'
    with open(filename, mode='wb') as f:
        f.write(video_content)
        print('正在保存：', video_title)

當保存的時候出現了報錯，

因為標題中出現特殊字符，沒有辦法保存，

之前有說過，當你新建檔案的時候，檔案命名中出現特殊字符是沒有辦法命名創建的，所以需要使用正則運算式，替換掉標題中的特殊字符，

def change_title(title):
    pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]")  # '/ \ : * ? " < > |'
    new_title = re.sub(pattern, "_", title)  # 替換為下劃線
    return new_title

完整實作代碼

import requests
import re
import threading

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}


def get_response(html_url):
    response = requests.get(url=html_url, headers=headers)
    return response


def change_title(title):
    pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]")  # '/ \ : * ? " < > |'
    new_title = re.sub(pattern, "_", title)  # 替換為下劃線
    return new_title


def save(video_url, video_title):
    video_content = get_response(video_url).content
    filename = 'video\\' + video_title + '.mp4'
    with open(filename, mode='wb') as f:
        f.write(video_content)
        print('正在保存：', video_title)
        print(video_url)


def get_video_url(video_id):
    data_url = 'https://www.pearvideo.com/videoStatus.jsp?contId={}&mrd=0.5606814943122209'.format(video_id)
    headers_1 = {
        'Referer': 'https://www.pearvideo.com/video_{}'.format(video_id),
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=data_url, headers=headers_1)
    html_data = response.json()
    video_url = html_data['videoInfo']['videos']['srcUrl']
    suffix = video_url.split('-')[1]
    date = video_url.split('/')[-2]
    new_url = 'https://video.pearvideo.com/mp4/adshort/{}/cont-{}-{}-ad_hd.mp4'.format(date, video_id, suffix)
    return new_url


def main(page_url):
    html_data = get_response(page_url).text
    video_ids = re.findall('<a href="https://www.cnblogs.com/hhh188764/archive/2020/12/24/video_(/d+)" >', html_data)
    title = re.findall('<h2 >(.*?)</h2>', html_data)
    video_info = zip(video_ids, title)
    for i in video_info:
        video_title = i[1]
        video_id = i[0]
        video_url = get_video_url(video_id)
        new_title = change_title(video_title)
        save(video_url, new_title)


if __name__ == '__main__':
    for page in range(0, 101, 10):
        url = 'https://www.pearvideo.com/popular_loading.jsp?reqType=1&categoryId=&start={}&sort=9&mrd=0.9278261602425337'.format(page)
        main_thread = threading.Thread(target=main, args=(url,))
        main_thread.start()

排行榜一共是74條資料，，，，，，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/239892.html

標籤：其他

上一篇：Spring Cloud 2020.0.0 正式發布，全新顛覆性版本！

下一篇：例外解決：swagger2.9.2 報java.lang.NumberFormatException: For input string: ““...

Python爬蟲進階：爬取梨視頻網站Top排行榜視頻資料