前言

本文的文字及圖片來源于網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理，

Python爬蟲、資料分析、網站開發等案例教程視頻免費在線觀看

https://space.bilibili.com/523606542

前文內容

Python爬蟲新手入門教學（一）：爬取豆瓣電影排行資訊

Python爬蟲新手入門教學（二）：爬取小說

Python爬蟲新手入門教學（三）：爬取鏈家二手房資料

Python爬蟲新手入門教學（四）：爬取前程無憂招聘資訊

Python爬蟲新手入門教學（五）：爬取B站視頻彈幕

Python爬蟲新手入門教學（六）：制作詞云圖

Python爬蟲新手入門教學（七）：爬取騰訊視頻彈幕

基本開發環境

Python 3.6
Pycharm
wkhtmltopdf

一、目標需求

將CSDN這上面的文章內容爬取保存下來，保存成PDF的格式，

二、網頁資料分析

如果想要把網頁文章內容保存成PDF，首先你要下載一個軟體 wkhtmltopdf 不然你是沒有辦法實作的，可以自行去百度搜索下載，也可以找上面的交流群下載，

前幾篇文章已經講了，關于文字方面的爬取方式，對于爬取文本內容還是沒有難度了吧，

想要獲取文章內容，首先就要爬取每篇文章的url地址，

具體分析的流程之前的文章也有分享過，這里就跳過了，

python爬取CSDN博客文章并制作成PDF檔案

完整實作代碼

import pdfkit
import requests
import parsel

html_str = """
<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
{article}
</body>
</html>
"""


def save(article, title):
    pdf_path = 'pdf\\' + title + '.pdf'
    html_path = 'html\\' + title + '.html'
    html = html_str.format(article=article)
    with open(html_path, mode='w', encoding='utf-8') as f:
        f.write(html)
        print('{}已下載完成'.format(title))
    # exe 檔案存放的路徑
    config = pdfkit.configuration(wkhtmltopdf='C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe')
    # 把 html 通過 pdfkit 變成 pdf 檔案
    pdfkit.from_file(html_path, pdf_path, configuration=config)


def main(html_url):
    # 請求頭
    headers = {
        "Host": "blog.csdn.net",
        "Referer": "https://blog.csdn.net/qq_41359265/article/details/102570971",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
    }
    # 用戶資訊
    cookie = {
        'Cookie': '你自己的cookie'
    }
    response = requests.get(url=html_url, headers=headers, cookies=cookie)
    selector = parsel.Selector(response.text)
    urls = selector.css('.article-list h4 a::attr(href)').getall()
    for html_url in urls:
        response = requests.get(url=html_url, headers=headers, cookies=cookie)
        # text 文本（字串）
        # 遭遇了反扒
        # print(response.text)
        """如何把 HTML 變成 PDF 格式"""
        # 提取文章部分
        sel = parsel.Selector(response.text)
        # css 選擇器
        article = sel.css('article').get()
        title = sel.css('h1::text').get()
        save(article, title)


if __name__ == '__main__':
    url = 'https://blog.csdn.net/fei347795790/article/list/1'
    main(url)

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/253823.html

標籤：Python

上一篇：機器學習|深度學習演算法模型——人工神經網路(ANN)

下一篇：Django Full Coverage(飛速入門)

Python爬蟲新手入門教學（八）：爬取論壇文章保存成PDF

前言

前文內容

基本開發環境

相關模塊的使用

一、目標需求

二、網頁資料分析

完整實作代碼