爬蟲120例之第17例，用Python面向物件的思路，采集各種精彩句子-有解無憂

采集完這7000+句子，里面好多神轉折的段子呀
eg:我若帶傘，便是晴天，若不帶傘，便是雨天，

目標站點分析

本次要抓取的目標站點地址為學句子網，目標地址為 http://www.xuejuzi.cn/gaoxiao/，第一步需要獲取下圖紅框位置詳情頁鏈接，

爬蟲120例，第一階段最后1篇，用Python面向物件的思路，采集各種精彩句子
串列頁分頁規律如下，區分第一頁即可，

http://www.xuejuzi.cn/gaoxiao
http://www.xuejuzi.cn/gaoxiao/2.html
http://www.xuejuzi.cn/gaoxiao/3.html

由于網頁中存在 末頁 資料，可通過提取頁面資料獲取總頁碼，

爬蟲120例，第一階段最后1篇，用Python面向物件的思路，采集各種精彩句子
詳情頁資料提取也比較簡單，目標資料存在于 p 標簽中，

詳細編碼如下

本案例詳細代碼如下，重要部分已經添加到注釋中，

import requests
from lxml import etree
import random


class Spider16:
    def __init__(self):

        self.wait_urls = ["http://www.xuejuzi.cn/gaoxiao/"]
        self.url_template = "http://www.xuejuzi.cn/gaoxiao/{num}.html"
        self.details = []

    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
            "Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",
            "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",
            "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
            "Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",
            "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
            "Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
            "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
            "Sosospider+(+http://help.soso.com/webspider.htm)",
            "Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers

    # 生成待爬取頁面
    def create_urls(self):
        headers = self.get_headers()
        page_url = self.wait_urls[0]
        res = requests.get(url=page_url, headers=headers, timeout=5)
        html = etree.HTML(res.text)
        # 提取總頁碼
        last_page = html.xpath("//div[@class='page']/a[last()]/@href")
        if len(last_page) > 0:
            last_page = int(last_page[0].split(".")[0])

        # 生成待爬取頁面
        for i in range(1, last_page + 1):
            self.wait_urls.append(self.url_template.format(num=i))

    def get_html(self):
        for url in self.wait_urls:
            headers = self.get_headers()
            res = requests.get(url, headers=headers, timeout=5)
            if res:
                html = etree.HTML(res.text)
                detail_link = html.xpath("//dl/dd[1]/a/@href")
                self.details.extend(detail_link)

    def get_detail(self):
        for url in self.details:
            headers = self.get_headers()
            res = requests.get(url, headers=headers, timeout=5)
            res.encoding = "gb2312"
            if res:
                html = etree.HTML(res.text)
                sentences = html.xpath("//div[@class='content']/p/text()")
                # 列印句子
                long_str = "\n".join(sentences)

                with open("sentences.txt","a+",encoding="utf-8") as f:
                    f.write(long_str)

    def run(self):
        self.create_urls()
        self.get_html()
        self.get_detail()

if __name__ == '__main__':
    s = Spider16()
    s.run()

最終爬取到的句子，有的確實有趣：

1，時間真的很寶貴，就差一秒廁所就被其他人搶了，
2，我要給我未來婆婆一個差評，發貨太慢，
3，愛上你，疼死了自己，
4，戒煙了，再抽真就騰云駕霧了！
5，我發現這么多年我就是一個褲衩，什么屁都得接著，
6，祝我生日快樂！愿我未來的媳婦找到我，我們趕緊登記結婚生孩子，

收藏時間

代碼下載地址：https://codechina.csdn.net/hihell/python120，可否給個 Star，

本案例采集到的素材下載：https://download.csdn.net/download/hihell/21048666

來都來了，不發個評論，點個贊，收個藏嗎？

今天是持續寫作的第 196 / 200 天，
可以關注我，點贊我、評論我、收藏我啦，

更多精彩

《Python 爬蟲 100 例》只需要 39.9 元，即可享受 100+篇博客閱讀權，每篇不到 4 毛錢，

Python 爬蟲 100 例教程導航帖（已完結）

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/293736.html

標籤：python

上一篇：??????真心喜歡，無意間發現CSDN的兩個插件，無敵！！！讓學Python變得更加so easy！

下一篇：Python爬蟲實戰-性感gif圖資料采集