【python爬蟲】 python 爬取知乎的公開收藏夾-有解無憂

前言

看看如何用 python 爬取知乎的公開收藏夾

內容

嘗試

第一個方法
開始的時候用 python ，request 庫進行的網頁請求，在請求你的收藏夾總界面的時候還可以回傳資訊，這個 url, https://www.zhihu.com/people/xxx/collections,,xxx 部分可以查看自己知乎賬號那兒是長怎么樣的，再進入了具體的收藏夾頁面的時候 https://www.zhihu.com/collection/3341994xx request 就回傳不了內容，這應該是因為知乎這個頁面是 js 動態加載的 (需要 js 逆向)，request 這個鏈接回傳不了，你要的內容，，
第二個方法
用 selenium 模擬瀏覽器進行爬蟲，selenium 是 python 一個用來控制瀏覽器的庫 (pip 下載)，可以用來做關于瀏覽器的自動化，也可以用來爬蟲，它需要搭配瀏覽器的驅動進行使用，火狐，chrome，edge 都有自己的驅動，火狐驅動地址，谷歌驅動地址，edge驅動地址，

selenium

登錄知乎

下好庫和驅動后，開始寫一下，發現在用 selenium 操作瀏覽器打開知乎，輸入密碼登錄時會出現 10001 錯誤，一個博客上寫是因為 js 判斷識別出來這是機器在操作，網上有一些解決方法，這里選取了用瀏覽器 debug 模式，新建了一個用戶檔案夾，每次打開瀏覽器直接控制這個新的瀏覽器，

如何新建瀏覽器用戶檔案
1. 找到瀏覽器 exe 檔案目錄，在這里進入 cmd 命令列模式，
2. 輸入以下代碼 chrome.exe --remote-debugging-port=9222 --user-data-dir="E:\data_info\selenium_data ，chrome. exe 就是對于瀏覽器的 exe 檔案 (edge 就是 msedge. exe) , 9222 自己選一個埠號等下在代碼中要寫一下，最后那個是生成的新的用戶目錄自己寫一個，
3. 對瀏覽器創建一個新的快捷方式放到桌面，然后右鍵屬性，在目標這一欄填上 2 中的代碼，點擊就會打開一個新的瀏覽器，然后在這里先登錄好知乎，

具體代碼

from requests import options
from selenium import webdriver  # 用來驅動瀏覽器的
import time
import selenium
from selenium.webdriver.edge.options import Options
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
import os
os.startfile("C:\\Users\\Administrator\\Desktop\\Microsoft Edge (1).lnk") # 打開設定好的瀏覽器快捷方式
options = Options() # 得到edge的設定
options.add_experimental_option("debuggerAddress", "127.0.0.1:6001") # 配置瀏覽器的埠地址
#options.add_experimental_option('excudeSwitches',['enable-automation'])
driver=webdriver.Edge(service =Service("D:\\BaiduNetdiskWorkspace\\Lite Code\\python腳本\\firefox_selenium\\msedgedriver.exe"),options=options) # 瀏覽器驅動的放置地址
time.sleep(3)
write = ""

def get_all_folder(url):
    driver.get(url)
    time.sleep(2)
    href=https://www.cnblogs.com/shucode/archive/2022/10/07/[]
    title = driver.find_elements(By.XPATH,'(//a[@])')
    for ind,i in enumerate(title):
        href.append(title[ind].get_attribute('href'))
    return href

def pythonpazhihu(url,write):
    driver.get(url)
    time.sleep(3)
    #h2 = driver.find_elements(By.CLASS_NAME,"ContentItem-title")
    title = driver.find_elements(By.XPATH,'(//h2[@]//a)')
    h3 = driver.find_elements(By.XPATH,'(//h2[@]//a[@href])')
    for ind,i in enumerate(h3):
        content = str(title[ind].text)+" , "+str(h3[ind].get_attribute('href'))
        write=write+content+"\n"
        print(title[ind].text,h3[ind].get_attribute('href'))
    #print(h3.text)
    time.sleep(2)
    return write
try:    
    url_all1 = "https://www.zhihu.com/collections/mine?page=1" # 總收藏也有兩頁，得到這兩頁每個收藏夾的具體鏈接
    url_all2 = "https://www.zhihu.com/collections/mine?page=2"
    href1 = get_all_folder(url_all1)
    href2 = get_all_folder(url_all2)
    href2 = href1+href2
    #print(href2)
    for url_son in href2:
        for i in range(5):
            #url = 'https://www.zhihu.com/collection/7179314xx?page=%s'%(i+1)
            url = url_son+'?page=%s'%(i+1) # 對每個收藏夾鏈接進行5頁的回圈
            write = pythonpazhihu(url,write) # 把讀到的標題和鏈接寫到write變數中
finally:
    driver.close()
    with open("./zhihu.txt","w",encoding="utf-8") as fp:
        fp.write(write)

小結

代碼思路就是，先打開瀏覽器快捷方式，訪問總的收藏夾頁面，得到每個的收藏夾鏈接，再訪問每個具體鏈接獲取收藏的標題和地址，
注意的，1. 瀏覽器驅動地址填寫用的 service 的方式，這是 selenium 更新后新的寫法，2. selenium 新的定位變成了 find_elements 有兩個引數，By. xxx ，用來表示用什么方式定位，例如 By. xpath, 注意 xpath 內容要用 () 括起來，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/510980.html

標籤：其他

上一篇：R語言—資料基礎及練習

下一篇：Exception和Error的區別