Python基礎之爬取小說-有解無憂

近些年里，網路小說盛行，但是小說網站為了增加收益，在小說中增加了很多廣告彈窗，令人煩不勝煩，那如何安靜觀看小說而不看廣告呢？答案就是爬蟲，本文主要以一個簡單的小例子，簡述如何通過爬蟲來爬取小說，僅供學習分享使用，如有不足之處，還請指正，

目標頁面

本文爬取的為【縱橫中文網】的一部小說【妙手小醫仙】，已完結，共187章，資訊如下：

網址：http://book.zongheng.com/showchapter/1102448.html

本次主要爬取小說章節資訊，及每一章對應的正文資訊，章節資訊如下所示：

目標分析

1. 章節目錄分析

通過瀏覽器自帶的開發人員工具【快捷鍵F12或Ctrl+Shift+I】進行分析，發現所有的章節都包含在ul【無序串列標簽】中，每一個章節鏈接對應于li【串列專案標簽】標簽中的a【超鏈接標簽】標簽，其中a標簽的href屬性就是具體章節網址，a標簽的文本就是章節標題，如下所示：

2. 章節正文分析

通過分析，發現章節全部內容，均在div【class=reader_box】中，其中包括標題div【class=title_txtbox】，章節資訊div【class=bookinfo】，及正文資訊div【class=content】，所有正文包含在p【段落標簽】中，如下所示：

爬蟲設計思路

獲取章節頁面內容，并進行決議，得到章節串列
回圈章節串列：
1. 獲取每一章節內容，并進行決議，得到正文內容，
2. 保存到文本檔案，每一個章節，一個檔案，

示例原始碼

獲取請求頁面內容，因為本例需要多次獲取頁面內容，所以封裝為一個單獨的函式，如下所示：

 1 def get_data(url: str = None):
 2     """
 3     獲取資料
 4     :param url: 請求網址
 5     :return:回傳請求的頁面內容
 6     """
 7     # 請求頭，模擬瀏覽器，否則請求會回傳418
 8     header = {
 9         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
10                       'Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363'}
11     resp = requests.get(url=url, headers=header)  # 發送請求
12 
13     if resp.status_code == 200:
14         if resp.encoding != resp.apparent_encoding:
15             # 如果回傳的編碼和頁面顯示編碼不一致，直接獲取text會出現亂碼，需要轉碼
16             return resp.content.decode(encoding=resp.apparent_encoding)
17         else:
18             # 如果回傳成功，則回傳內容
19             return resp.text
20     else:
21         # 否則，列印錯誤狀態碼，并回傳空
22         print('回傳狀態碼：', resp.status_code)
23         return

注意：有可能不同網站，回傳內容的編碼和頁面顯示的編碼不一致，可能會出現中文亂碼的情況，所以本例進行編碼設定，

1. 決議章節串列

要獲取整本小說內容，首先就要獲取章節串列，然后保存到記憶體陣列中，以便于獲取具體正文，如下所示：

 1 def parse_chapters(html: str = None):
 2     """
 3     爬取章節串列
 4     :param html:
 5     :return:
 6     """
 7     if html is None:
 8         return
 9     else:
10         chapters = []
11         bs = BeautifulSoup(html, features='html.parser')
12         ul_chapters = bs.find('ul', class_='chapter-list clearfix')
13         # print(ul_chapters)
14         li_chapters = ul_chapters.find_all('li', class_='col-4')  # 此處需要注意，頁面原始碼查看是有空格，但是BeautifulSoup轉換后空格消失
15         for li_chapter in li_chapters:
16             a_tag = li_chapter.find('a')
17             # print(a_tag)
18             a_href = https://www.cnblogs.com/hsiang/archive/2021/07/15/a_tag.get('href')  # 此處也可以用a_tag['href']
19             a_text = a_tag.get_text()  # 獲取內容：章節標題
20             chapters.append({'title': a_text, 'href': a_href})
21 
22         return chapters

2. 決議單個章節

當得到單個章節的鏈接時，就可以獲取單個章節的內容，并進行決議，如下所示：

 1 def parse_single_chapter(html: str = None):
 2     """
 3     決議單個章節內容
 4     :param html:
 5     :return:
 6     """
 7     bs = BeautifulSoup(html, features='html.parser')
 8     div_reader_box = bs.find('div', class_='reader_box')
 9     div_title = div_reader_box.find('div', class_='title_txtbox')
10     title = div_title.get_text()  # 獲取標題
11     div_book_info = div_reader_box.find('div', class_='bookinfo')
12     book_info = div_book_info.get_text()
13     div_content = div_reader_box.find('div', class_='content')
14     content = ''
15     p_tags = div_content.find_all('p')
16     for p_tag in p_tags:
17         content =content + p_tag.get_text() + '\r\n'
18     # content = div_content.get_text()
19     return title + '\n' + book_info + '\n' + content

3. 回圈決議并保存

回圈獲取單個章節正文頁面，并進行決議，然后保存，如下所示：

 1 def get_and_parser_single_chapter(chapters: list = []):
 2     """
 3     獲取單個章節
 4     :param chapters: 章節串列
 5     :return:
 6     """
 7     for (index, chapter) in enumerate(chapters, 1):
 8         title = chapter.get('title')
 9         href = https://www.cnblogs.com/hsiang/archive/2021/07/15/chapter.get('href')
10         while True:
11             print('開始第%d章爬取' % index)
12             html = get_data(href)
13             if html is not None:
14                 content = parse_single_chapter(html)
15                 save_data(title, content)  # 保存資料
16                 print('第%d章爬取成功' % index)
17                 break
18             else:
19                 print('第%d章爬取失敗' % index)
20                 time.sleep(2)

4. 整體呼叫邏輯

當寫好單個功能函式時，順序呼叫就是完整的爬蟲，如下所示：

1 url = 'http://book.zongheng.com/showchapter/1102448.html'
2 print('開始時間>>>>>', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
3 html_chapters = get_data(url)
4 chapters = parse_chapters(html_chapters)
5 get_and_parser_single_chapter(chapters)
6 print('結束時間>>>>>', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
7 print('done')

示例截圖

爬取到的小說串列，如下所示：

每一個章節內容，如下所示：

示例完整代碼，如下所示：

  1 import requests  # 請求包  用于發起網路請求
  2 from bs4 import BeautifulSoup  # 決議頁面內容幫助包
  3 import time
  4 
  5 """
  6 說明：爬取小說
  7 步驟：1. 先爬取所有章節，及章節明細對應的URL
  8 2. 決議單個章節的內容
  9 3. 保存
 10 """
 11 
 12 
 13 def get_data(url: str = None):
 14     """
 15     獲取資料
 16     :param url: 請求網址
 17     :return:回傳請求的頁面內容
 18     """
 19     # 請求頭，模擬瀏覽器，否則請求會回傳418
 20     header = {
 21         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
 22                       'Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363'}
 23     resp = requests.get(url=url, headers=header)  # 發送請求
 24 
 25     if resp.status_code == 200:
 26         if resp.encoding != resp.apparent_encoding:
 27             # 如果回傳的編碼和頁面顯示編碼不一致，直接獲取text會出現亂碼，需要轉碼
 28             return resp.content.decode(encoding=resp.apparent_encoding)
 29         else:
 30             # 如果回傳成功，則回傳內容
 31             return resp.text
 32     else:
 33         # 否則，列印錯誤狀態碼，并回傳空
 34         print('回傳狀態碼：', resp.status_code)
 35         return
 36 
 37 
 38 def parse_chapters(html: str = None):
 39     """
 40     爬取章節串列
 41     :param html:
 42     :return:
 43     """
 44     if html is None:
 45         return
 46     else:
 47         chapters = []
 48         bs = BeautifulSoup(html, features='html.parser')
 49         ul_chapters = bs.find('ul', class_='chapter-list clearfix')
 50         # print(ul_chapters)
 51         li_chapters = ul_chapters.find_all('li', class_='col-4')  # 此處需要注意，頁面原始碼查看是有空格，但是BeautifulSoup轉換后空格消失
 52         for li_chapter in li_chapters:
 53             a_tag = li_chapter.find('a')
 54             # print(a_tag)
 55             a_href = https://www.cnblogs.com/hsiang/archive/2021/07/15/a_tag.get('href')  # 此處也可以用a_tag['href']
 56             a_text = a_tag.get_text()  # 獲取內容：章節標題
 57             chapters.append({'title': a_text, 'href': a_href})
 58 
 59         return chapters
 60 
 61 
 62 def get_and_parser_single_chapter(chapters: list = []):
 63     """
 64     獲取單個章節
 65     :param chapters: 章節串列
 66     :return:
 67     """
 68     for (index, chapter) in enumerate(chapters, 1):
 69         title = chapter.get('title')
 70         href = https://www.cnblogs.com/hsiang/archive/2021/07/15/chapter.get('href')
 71         while True:
 72             print('開始第%d章爬取' % index)
 73             html = get_data(href)
 74             if html is not None:
 75                 content = parse_single_chapter(html)
 76                 save_data(title, content)  # 保存資料
 77                 print('第%d章爬取成功' % index)
 78                 break
 79             else:
 80                 print('第%d章爬取失敗' % index)
 81                 time.sleep(2)
 82 
 83 
 84 def parse_single_chapter(html: str = None):
 85     """
 86     決議單個章節內容
 87     :param html:
 88     :return:
 89     """
 90     bs = BeautifulSoup(html, features='html.parser')
 91     div_reader_box = bs.find('div', class_='reader_box')
 92     div_title = div_reader_box.find('div', class_='title_txtbox')
 93     title = div_title.get_text()  # 獲取標題
 94     div_book_info = div_reader_box.find('div', class_='bookinfo')
 95     book_info = div_book_info.get_text()
 96     div_content = div_reader_box.find('div', class_='content')
 97     content = ''
 98     p_tags = div_content.find_all('p')
 99     for p_tag in p_tags:
100         content =content + p_tag.get_text() + '\r\n'
101     # content = div_content.get_text()
102     return title + '\n' + book_info + '\n' + content
103 
104 
105 def save_data(name, content):
106     """
107     保存資料
108     :param name: 檔案名
109     :param content: 檔案內容
110     :return:
111     """
112     with open('妙手小醫仙\\' + name + '.txt', 'w', encoding='utf-8') as f:
113         f.write(content)
114 
115 
116 url = 'http://book.zongheng.com/showchapter/1102448.html'
117 print('開始時間>>>>>', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
118 html_chapters = get_data(url)
119 chapters = parse_chapters(html_chapters)
120 get_and_parser_single_chapter(chapters)
121 print('結束時間>>>>>', time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
122 print('done')

View Code

備注

我從來不認為半小時是我微不足道的很小的一段時間，真正的強者，不是沒有眼淚的人，而是含著眼淚奔跑的人，但行前路，無問西東，

長相思·山一程

納蘭性德【朝代】清

山一程，水一程，身向榆關那畔行，夜深千帳燈，

風一更，雪一更，聒碎鄉心夢不成，故園無此聲，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/289083.html

標籤：其他

上一篇：fastjson: json物件，json物件陣列，javabean物件，json字串之間的相互轉化

下一篇：Java JUC并發之JMM原理詳解