Python爬蟲入門教程08：爬取csdn文章保存成PDF-有解無憂

前言??

本文的文字及圖片來源于網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理，

前文內容??

Python爬蟲入門教程01：豆瓣Top電影爬取

Python爬蟲入門教程02：小說爬取

Python爬蟲入門教程03：二手房資料爬取

Python爬蟲入門教程04：招聘資訊爬取

Python爬蟲入門教程05：B站視頻彈幕的爬取

Python爬蟲入門教程06：爬取資料后的詞云圖制作

Python爬蟲入門教程07：騰訊視頻彈幕爬取

PS：如有需要 Python學習資料 以及 解答 的小伙伴可以加點擊下方鏈接自行獲取
python免費學習資料以及群交流解答點擊即可加入

基本開發環境??

Python 3.6
Pycharm
wkhtmltopdf

相關模塊的使用??

pdfkit
requests
parsel

安裝Python并添加到環境變數，pip安裝需要的相關模塊即可，

一、??目標需求

在這里插入圖片描述
將CSDN這上面的文章內容爬取保存下來，保存成PDF的格式，

二、??網頁資料分析

如果想要把網頁文章內容保存成PDF，首先你要下載一個軟體 wkhtmltopdf 不然你是沒有辦法實作的，可以自行去百度搜索下載，也可以找上面的 交流群 下載，
在這里插入圖片描述
前幾篇文章已經講了，關于文字方面的爬取方式，對于爬取文本內容還是沒有難度了吧，

想要獲取文章內容，首先就要爬取每篇文章的url地址，
在這里插入圖片描述
具體分析的流程之前的文章也有分享過，這里就跳過了，

python爬取CSDN博客文章并制作成PDF檔案

??完整實作代碼

import pdfkit
import requests
import parsel

html_str = """
<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
{article}
</body>
</html>
"""


def save(article, title):
    pdf_path = 'pdf\\' + title + '.pdf'
    html_path = 'html\\' + title + '.html'
    html = html_str.format(article=article)
    with open(html_path, mode='w', encoding='utf-8') as f:
        f.write(html)
        print('{}已下載完成'.format(title))
    # exe 檔案存放的路徑
    config = pdfkit.configuration(wkhtmltopdf='C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe')
    # 把 html 通過 pdfkit 變成 pdf 檔案
    pdfkit.from_file(html_path, pdf_path, configuration=config)


def main(html_url):
    # 請求頭
    headers = {
        "Host": "blog.csdn.net",
        "Referer": "https://blog.csdn.net/qq_41359265/article/details/102570971",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
    }
    # 用戶資訊
    cookie = {
        'Cookie': '你自己的cookie'
    }
    response = requests.get(url=html_url, headers=headers, cookies=cookie)
    selector = parsel.Selector(response.text)
    urls = selector.css('.article-list h4 a::attr(href)').getall()
    for html_url in urls:
        response = requests.get(url=html_url, headers=headers, cookies=cookie)
        # text 文本（字串）
        # 遭遇了反扒
        # print(response.text)
        """如何把 HTML 變成 PDF 格式"""
        # 提取文章部分
        sel = parsel.Selector(response.text)
        # css 選擇器
        article = sel.css('article').get()
        title = sel.css('h1::text').get()
        save(article, title)


if __name__ == '__main__':
    url = 'https://blog.csdn.net/fei347795790/article/list/1'
    main(url)

在這里插入圖片描述

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/252901.html

標籤：Python

上一篇：Go遍歷struct,map,slice

下一篇：PyCharmLearningProject摘要（查閱備用）