一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章-有解無憂

文章目錄

- 案例 13：斗圖啦表情包爬取
- 案例 14：PDF 電子書下載
- 案例 15：政民互動資料采集
- 案例 16：500px 攝影師社區
- 案例 17：CSDN 博客抓取資料
- 案例 18：煎蛋網 XXOO
- 案例 19：51CTO 學堂課程資料抓取
- 今日復盤結論
- 收藏時間

案例 13：斗圖啦表情包爬取

原文參考鏈接：https://dream.blog.csdn.net/article/details/83020175

沒想到在 2018 年，我就爬取了這個站點，現在打開這個網址竟然依舊可以訪問，

測驗代碼，發現無問題，正常可用，不過我還是上傳了一份到 codechina 中

案例 14：PDF 電子書下載

原文參考鏈接：https://dream.blog.csdn.net/article/details/83151879

當前爬取這個網站的時候，橡皮擦還在吐槽這是一個小清新網站，一點廣告都沒有，但在 3 年后的今天，這個網站消失了，果然盈利才是硬道理，

沒辦法，我懷著無比心動的心情，又找到了一個新的小清新站點，

免費技術書籍，這個就更加有趣了，都是技術人員閱讀的書籍，

https://www.freetechbooks.com/topics，在爬取該網站的時候，由于對方服務器在國外，顧下載 PDF 時，存在部分問題，本復盤階段，就不在進行擴展，

一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章

案例 15：政民互動資料采集

在復盤這個案例的時候，心里一抖，幸虧當年沒現在這么的，網站已經變成很紅的顏色了，

打開網站地址：https://www.sjz.gov.cn/col/1597714516660/index.html ，核心資料是使用的 iframe 進行的嵌套，

選擇查看框架內原始碼，進入真實頁面，

一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章
在框架原始碼中檢索真實的地址，進行采集即可，可替換到原案例中的 selenium，使用普通的請求采集即可，

一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章

案例 16：500px 攝影師社區

一句話，介面都在，它還很好，

案例 17：CSDN 博客抓取資料

這個案例竟然是爬取 CSDN，大水沖了龍王廟呀，

看了一下，最后竟然是因為那一天是 1024，

一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章
檢測介面發現，shown_offset 引數已經被取消，現在的介面格式如下：

https://blog.csdn.net/api/articles?type=more&category=python&shown_offset=0

資料的核心請求引數，經測驗在 cookie 中只有 uuid_tt_dd 會對結果產生影響，顧獲取資料時，動態從 cookie 獲取該值，或手動輸入即可，

import requests
import time
import requests

START_URL = "https://blog.csdn.net/api/articles?type=more&category=home&shown_offset=0"
HEADERS = {
    "Accept":"application/json",
    "Host":"www.csdn.net",
    "Referer":"https://www.csdn.net/",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "X-Requested-With":"XMLHttpRequest",
    "cookie":'uuid_tt_dd=10_從 cookie 自行獲取;'
}
def get_url(url):
    try:
        res = requests.get(url,headers=HEADERS,timeout=3)

        articles = res.json()
        if articles["status"]:
            need_data = articles["articles"]
            if need_data:
                # 輸出第一條 title
                print(need_data[0]["title"])
                print("成功獲取到{}條資料".format(len(need_data)))
            # last_shown_offset = articles["shown_offset"]  # 獲取最后一條資料的時間戳
            # if last_shown_offset:
                time.sleep(1)
                get_url(START_URL)
    except Exception as e:
        print(e)
        print("系統暫停60s，當前出問題的是{}".format(url))

        time.sleep(60) # 出問題之后，停止60s，繼續抓取
        get_url(url)

if __name__ == "__main__":
   get_url(START_URL)

案例 18：煎蛋網 XXOO

這個網站已經改名為隨手拍，變化真大，案例使用的依舊是 selenium，學習它可以參考《滾雪球學 Python 番外篇（完結）》，

所以本案例就不在進行復盤更新，網站依舊可訪問，顧核心技術點基本一致，

案例 19：51CTO 學堂課程資料抓取

打開原案例中的地址，界面 UI 已經發生變化，但是資料還在，

https://edu.51cto.com/courselist/index-p1.html?edunav=
https://edu.51cto.com/courselist/index-p2.html?edunav=
https://edu.51cto.com/courselist/index-p3.html?edunav=

不得不說，當前爬取的時候，51CTO 只有 1W+課程資料，3 年過去了，翻了一倍，

一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章
簡單的修改一下代碼邏輯，該案例依舊可用，為了便于測驗，只展示核心部分代碼

from requests_html import AsyncHTMLSession   # 匯入異步模塊

asession = AsyncHTMLSession()

BASE_URL = "https://edu.51cto.com/courselist/index-p{}.html?edunav="

async def get_html():
    for i in range(1,3):
        r =  await asession.get(BASE_URL.format(i))   # 異步等待
        get_item(r.html)

def get_item(html):
    c_list = html.find('.Content-left',first=True)
    if c_list:

        items = c_list.find('li[class^=li_4n]')
        print(items)
        for item in items:
            title = item.find("div[class='title']",first=True).text # 課程名稱
            href = item.find('a',first=True).attrs["href"]  # 課程的鏈接地址
            # class_time = item.find("div.course_infos>p:eq(0)",first=True).text
            # study_nums = item.find("div.course_infos>p:eq(1)", first=True).text
            # stars = item.find("div.course_infos>div", first=True).text
            # course_target = item.find(".main>.course_target", first=True).text
            # price = item.find(".main>.course_payinfo h4", first=True).text
            # dict = {
            #     "title":title,
            #     "href":href,
            #     "class_time":class_time,
            #     "study_nums":study_nums,
            #     "stars":stars,
            #     "course_target":course_target,
            #     "price":price
            # }
            # print(dict)
            print(title,href)

    else:
        print("資料決議失敗")

if __name__ == '__main__':
    result = asession.run(get_html)

一篇博客，拿下7個爬蟲案例，夠幾天的學習量啦，《爬蟲100例》第4篇復盤文章

今日復盤結論

今日復盤了 7 個案例，其中大多數網站依舊在線，散發活力，當然爬蟲也依舊在作業，加油學習吧，

良心博主，竟然 3 年不掉線，

收藏時間

本期博客收藏過 400，立刻更新下一篇

今天是持續寫作的第 193 / 200 天，
可以關注我，點贊我、評論我、收藏我啦，

更多精彩

Python 爬蟲 100 例教程導航帖（抓緊訂閱啦）

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/291689.html

標籤：其他

上一篇：學會這 6 招，網頁搜索一秒就能搜到你想要的【老司機必備神技】

下一篇：? 就這？TypeScript其實并不難！（建議收藏）?