采集完這7000+句子,里面好多神轉折的段子呀
eg:我若帶傘,便是晴天,若不帶傘,便是雨天,
目標站點分析
本次要抓取的目標站點地址為學句子網,目標地址為 http://www.xuejuzi.cn/gaoxiao/,第一步需要獲取下圖紅框位置詳情頁鏈接,

串列頁分頁規律如下,區分第一頁即可,
http://www.xuejuzi.cn/gaoxiao
http://www.xuejuzi.cn/gaoxiao/2.html
http://www.xuejuzi.cn/gaoxiao/3.html
由于網頁中存在 末頁 資料,可通過提取頁面資料獲取總頁碼,

詳情頁資料提取也比較簡單,目標資料存在于 p 標簽中,

詳細編碼如下
本案例詳細代碼如下,重要部分已經添加到注釋中,
import requests
from lxml import etree
import random
class Spider16:
def __init__(self):
self.wait_urls = ["http://www.xuejuzi.cn/gaoxiao/"]
self.url_template = "http://www.xuejuzi.cn/gaoxiao/{num}.html"
self.details = []
def get_headers(self):
uas = [
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",
"Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
"Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
"Sosospider+(+http://help.soso.com/webspider.htm)",
"Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"
]
ua = random.choice(uas)
headers = {
"user-agent": ua,
"referer": "https://www.baidu.com"
}
return headers
# 生成待爬取頁面
def create_urls(self):
headers = self.get_headers()
page_url = self.wait_urls[0]
res = requests.get(url=page_url, headers=headers, timeout=5)
html = etree.HTML(res.text)
# 提取總頁碼
last_page = html.xpath("//div[@class='page']/a[last()]/@href")
if len(last_page) > 0:
last_page = int(last_page[0].split(".")[0])
# 生成待爬取頁面
for i in range(1, last_page + 1):
self.wait_urls.append(self.url_template.format(num=i))
def get_html(self):
for url in self.wait_urls:
headers = self.get_headers()
res = requests.get(url, headers=headers, timeout=5)
if res:
html = etree.HTML(res.text)
detail_link = html.xpath("//dl/dd[1]/a/@href")
self.details.extend(detail_link)
def get_detail(self):
for url in self.details:
headers = self.get_headers()
res = requests.get(url, headers=headers, timeout=5)
res.encoding = "gb2312"
if res:
html = etree.HTML(res.text)
sentences = html.xpath("//div[@class='content']/p/text()")
# 列印句子
long_str = "\n".join(sentences)
with open("sentences.txt","a+",encoding="utf-8") as f:
f.write(long_str)
def run(self):
self.create_urls()
self.get_html()
self.get_detail()
if __name__ == '__main__':
s = Spider16()
s.run()
最終爬取到的句子,有的確實有趣:
1,時間真的很寶貴,就差一秒廁所就被其他人搶了,
2,我要給我未來婆婆一個差評,發貨太慢,
3,愛上你,疼死了自己,
4,戒煙了,再抽真就騰云駕霧了!
5,我發現這么多年我就是一個褲衩,什么屁都得接著,
6,祝我生日快樂!愿我未來的媳婦找到我,我們趕緊登記結婚生孩子,
收藏時間
代碼下載地址:https://codechina.csdn.net/hihell/python120,可否給個 Star,
本案例采集到的素材下載:https://download.csdn.net/download/hihell/21048666
來都來了,不發個評論,點個贊,收個藏嗎?
今天是持續寫作的第 196 / 200 天,
可以關注我,點贊我、評論我、收藏我啦,
更多精彩
《Python 爬蟲 100 例》只需要 39.9 元,即可享受 100+篇博客閱讀權,每篇不到 4 毛錢,
- Python 爬蟲 100 例教程導航帖(已完結)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/293736.html
標籤:python
