下載某乎專欄文章并存為markdown-有解無憂

前言

由于在2月13日，Autojs的作者發出公告將審查所有代碼，并在最新版洗掉了無障礙截圖、通知監聽等功能，在打開所有版本都會提示強制更新，之前關注的公眾號都連夜洗掉了教程文章，在搜索時，發現教程作者的文章在其它平臺還未洗掉，為了保險起見，備份一下他的文章，由于他寫的文章很多，文章將通過爬蟲的方式去獲取并保存為markdown檔案，

參考文章：

https://www.jianshu.com/p/b8530e554782
https://www.dandelioncloud.cn/article/details/1487706447181107201

實作思路

下載知乎專欄文章存入html
將所有html轉換為md
正則提取markdown中的圖片鏈接
下載圖片至本地
上傳圖片至碼云圖床
使用新圖片鏈接替換原來的圖片鏈接

下載知乎專欄文章存入html

在瀏覽器按F12打開除錯模式，訪問專欄鏈接，查看網路請求，可以發現這個就是我們想要的內容，某乎比較友好，回傳的是json，直接json決議即可，

實作代碼

def downloadZhuanLanToLocalHtml(zhuanLan,htmlSavePath):
    """
    下載知乎專欄文章存入html
    :param zhuanLan: 專欄地址  https://www.zhihu.com/column/c_1341718720926887936  地址是c_1341718720926887936
    :htmlSavePath: html檔案存放路徑
    """
    # 獲取總的文章數量
    urlIndex=f"https://www.zhihu.com/api/v4/columns/{zhuanLan}/items"
    res=requests.get(urlIndex,headers=headers)
    # 知乎比較友好，回傳的是json
    totals=json.loads(res.text)["paging"]["totals"]
    totalPage=totals//100+1  # 獲取總頁數
    for i in range(totalPage):
        # limit最大是100,超過會報錯
        urlpage = 'https://www.zhihu.com/api/v4/columns/{}/items?limit={}&offset={}'.format(zhuanLan, 100, 100*i)
        respage = requests.get(urlpage, headers=headers)
        data=https://www.cnblogs.com/bushrose/archive/2023/02/14/json.loads(respage.content)['data']
        for article in data:
            title=article["title"]
            content=article["content"]
            # 替換標題中的特殊符號，不然創建檔案會報錯
            with open(f'{htmlSavePath}\\{title.replace("?","").replace("？","")}.html',"w",encoding="utf-8") as f:
                f.write(content)
    print("下載完成")

將所有html轉換為md

將保存的html轉換為markdown，我們使用第三方庫html2text，在使用前請先安裝pip install htm2text，

def convertHtml2Markdow(htmlSavePath,mdSavePath):
    '''
    將所有html轉換為md
    : htmlSavePath: 存放html檔案的檔案夾路徑
    : mdSavePath:  存放markdown檔案的檔案夾路徑
    '''
    for file in os.listdir(htmlSavePath):
        # 獲取檔案名稱
        filename=os.path.basename(file).split(".")[0]
        text_maker = ht.HTML2Text()
        # 讀取html格式檔案
        with open(htmlSavePath+"/"+file, 'r', encoding='UTF-8') as f:
            htmlpage = f.read()
        # 處理html格式檔案中的內容
        text = text_maker.handle(htmlpage)
        # 寫入處理后的內容
        with open(mdSavePath+"/"+filename+".md", 'w', encoding='UTF-8') as f:
            f.write(text)

提取markdown檔案中的圖片鏈接

獲取md檔案中的鏈接，采用正則方式來提取，

def lambdaToGetMarkdownPicturePosition(content):
    """
    從markdownd代碼中提取圖片鏈接
    :param content: 
    :return: 
    """
    # 該正則只適合某乎的文章，其它的請自行調整
    pattern = re.compile(r"!\[.*?\]\([https|http].*?source=.*?\)")
    resultList = pattern.finditer(content)
    urlList = []
    for item in resultList:
        curStr = item.group()
        curStr = curStr.split('(')[1]
        curStr = curStr.strip(')')
        urlList.append(curStr)
        print(curStr)
    return urlList

下載圖片至本地

將獲取到的圖片鏈接，先下載到本地，同時保存號md檔案中圖片鏈接和本地圖片路徑的映射關系，方便后文替換為新的圖床的圖片鏈接，

def downloadPic(urls,picSavePath):
    '''
    下載圖片至本地
    : urls: 圖片路徑
    ：picSavePath: 本地存放圖片的檔案夾
    '''
    picMap={}
    for url in urls:
        res=requests.get(url)
        if res.status_code==200:
            savePicName=url.split("/")[-1]
            with open(f"{picSavePath}/{savePicName.split('?')[0]}","wb") as f:
                f.write(res.content)
            picMap[url]=f"{picSavePath}/{savePicName.split('?')[0]}"
        else:
            print("圖片下載失敗")
    return picMap

上傳圖片至gitee圖床

這一步將本地的圖片上傳到gitee的圖床，gitee提供了開放的api，通過api可以將圖片上傳至指定倉庫，

開放api地址：
https://gitee.com/api/v5/swagger#/postV5ReposOwnerRepoContentsPath

代碼實作

def uploadPicToGitee(picFullPath,access_token="你自己的token",owner="登錄的用戶名",repo="倉庫名",branch="存放的分支",giteeRepoSavePath="倉庫下某個目錄"):
    '''
    上傳檔案到gitee
    :picFullPath: 本地圖片路徑
    :giteeRepoSavePath: gitee倉庫中檔案存放路徑
    '''
    headers = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Connection': 'keep-alive',
        'Content-Type': 'application/json;charset=UTF-8',
        'Origin': 'https://gitee.com',
        'Referer': 'https://gitee.com/api/v5/swagger',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
        'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
    }

    # 將圖片進行bas4編碼
    with open(picFullPath,"rb") as f:
        content=base64.b64encode(f.read())
    
    picDir=os.path.dirname(os.path.abspath(picFullPath))
    # 切換目錄
    os.chdir(picDir)
    picName=os.path.basename(picFullPath)
    data = https://www.cnblogs.com/bushrose/archive/2023/02/14/{'access_token': access_token,
        'content': content,
        'message': f'upload-{picName}',
        'branch': branch,
    }

    # 上傳檔案需要處理data 
    data=https://www.cnblogs.com/bushrose/archive/2023/02/14/MultipartEncoder(fields=data)
    headers['Content-Type']=data.content_type

    res = requests.post(f'https://gitee.com/api/v5/repos/{owner}/{repo}/contents/{giteeRepoSavePath}/{picName}', headers=headers, data=https://www.cnblogs.com/bushrose/archive/2023/02/14/data,verify=False)
    if res.status_code==201 or res.text=='{"message":"A file with this name already exists"}':
        imgUrl=f'https://gitee.com/{owner}/{repo}/raw/{branch}/{giteeRepoSavePath}/{quote(picName)}'
        return imgUrl
    return None

運行程式

if __name__=="__main__":
    zhuanLan="c_1341718720926887936"
    htmlSavePath=r"C:\Users\teisyogun\Desktop\腳本\python_learn\test\test"   # 修改為自己html檔案的存放地址
    mdSavePath=r"C:\Users\teisyogun\Desktop\腳本\python_learn\test\test-md"  # 修改為md檔案的存放地址
    downloadZhuanLanToLocalHtml(zhuanLan,htmlSavePath)
    convertHtml2Markdow(htmlSavePath,mdSavePath)

     
    mdSavePath1=r"C:\Users\teisyogun\Desktop\腳本\python_learn\test\test-md\test3" # gitee長時間上傳會報超時錯誤，如果md檔案很多，分成多批次上傳

    picSavePath=r"C:\Users\teisyogun\Desktop\腳本\python_learn\test\test-pic"  # 修改為本地圖片的存放路徑

    #  下面的邏輯就是在完成原md檔案中圖片鏈接替換為gitee圖床的圖片鏈接
    for root,dirs,files in os.walk(mdSavePath1):
        for filename in files:
            mdFullPath=os.path.join(root,filename)
            mdBakFullPath=os.path.join(root,filename.replace(".md","-bak.md"))
            with open(mdFullPath,"r+",encoding="utf-8") as f,open(mdBakFullPath,"w",encoding="utf-8") as wf:
                print(mdFullPath)
                mdContent="".join(f.readlines())
                urlList=lambdaToGetMarkdownPicturePosition(mdContent)
                if len(urlList)!=0:
                    picMap=downloadPic(urlList,picSavePath)
                    print(picMap)
                    if bool(picMap):
                        for picUrl in picMap:
                            imgUrl=uploadPicToGitee(picMap[picUrl])
                            
                            print(imgUrl)
                            if imgUrl is not None:
                                mdContent=mdContent.replace(picUrl,imgUrl)
                            else:
                                print(f"檔案：{filename}中{picMap[picUrl]}替換失敗")
                        wf.write(mdContent)
                        print(f"{filename}+===替換完成")
                else:
                    wf.write(mdContent)

最終效果

所有替換后的md檔案

打開帶圖片的檔案，查看原始碼，發現圖片中的檔案已經被替換，

在預覽模式下查看，圖片可以正常顯示，至此我們完成了所有的功能，

代碼

獲取完整代碼，請在后臺回復【專欄】或在評論區留言，

總結

文章中只是簡單實作了需求，代碼比較亂，歡迎大家評論指正，

本文由【產品經理不是經理】gzh同步發布，歡迎關注

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/543872.html

標籤：其他

上一篇：什么是Python裝飾器？

下一篇：執行緒私有變數ThreadLocal詳解