都說python是萬能的，這次用python看溧陽攝影圈，真不錯-有解無憂

本篇博客繼續學習 BeautifulSoup，目標站點選取“溧陽攝影圈”，這一地方論壇，

目標站點分析

本次要采集的目標站點分頁規則如下：

http://www.jsly001.com/thread-htm-fid-45-page-{頁碼}.html

代碼采用多執行緒 threading 模塊+requests 模塊+BeautifulSoup 模塊撰寫，

采取規則依據串列頁 → 詳情頁，
用python看溧陽攝影圈，里面照片非常真實，一個地方活躍攝影論壇的采集之路

溧陽攝影圈圖片采集代碼

本案例屬于實操案例，bs4 相關知識點已經在上一篇博客進行鋪墊，顧先展示完整代碼，然后基于注釋與重點函式進行說明，

import random
import threading
import logging

from bs4 import BeautifulSoup
import requests
import lxml

logging.basicConfig(level=logging.NOTSET) # 設定日志輸出級別

# 宣告一個 LiYang 類，其繼承自 threading.Thread
class LiYangThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self) # 實體化多執行緒物件
        self._headers = self._get_headers() # 隨機獲取 ua
        self._timeout = 5 # 設定超時時間

    # 每個執行緒都去獲取全域資源
    def run(self):
        # while True: # 此處為多執行緒開啟位置
        try:
            res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,
                               timeout=self._timeout) # 測驗獲取第一頁資料
        except Exception as e:
            logging.error(e)

        if res is not None:
            html_text = res.text
            self._format_html(html_text) # 呼叫html決議函式

    def _format_html(self, html):
        # 使用 lxml 進行決議
        soup = BeautifulSoup(html, 'lxml')

        # 獲取板塊主題分割區域，主要為防止獲取置頂的主題
        part_tr = soup.find(attrs={'class': 'bbs_tr4'})

        if part_tr is not None:
            items = part_tr.find_all_next(attrs={"name": "readlink"}) # 獲取詳情頁地址
        else:
            items = soup.find_all(attrs={"name": "readlink"})

        # 決議出標題與資料
        data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items]
        # 進入標題內頁
        for name, url in data:
            self._get_imgs(name, url)

    def _get_imgs(self, name, url):
        """決議圖片地址"""
        try:
            res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
        except Exception as e:
            logging.error(e)
		# 圖片提取邏輯
        if res is not None:
            soup = BeautifulSoup(res.text, 'lxml')
            origin_div1 = soup.find(attrs={'class': 'tpc_content'})
            origin_div2 = soup.find(attrs={'class': 'imgList'})
            content = origin_div2 if origin_div2 else origin_div1

            if content is not None:
                imgs = content.find_all('img')

                # print([img.get("src") for img in imgs])
                self._save_img(name, imgs) # 保存圖片

    def _save_img(self, name, imgs):
        """保存圖片"""
        for img in imgs:
            url = img.get("src")
            if url.find('http') < 0:
                continue
            # 尋找父標簽中的 id 屬性
            id_ = img.find_parent('span').get("id")

            try:
                res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
            except Exception as e:
                logging.error(e)

            if res is not None:
                name = name.replace("/", "_")
                with open(f'./imgs/{name}_{id_}.jpg', "wb+") as f: # 注意在 python 運行時目錄提前創建 imgs 檔案夾
                    f.write(res.content)

    def _get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua
        }
        return headers


if __name__ == '__main__':
    my_thread = LiYangThread()
    my_thread.run()

本次案例采用中，BeautifulSoup 模塊采用 lxml 決議器 對 HTML 資料進行決議，后續多采用此決議器，在使用前注意先匯入 lxml 模塊，

資料提取部分采用 soup.find() 與 soup.find_all() 兩個函式進行，代碼中還使用了 find_parent() 函式，用于采集父級標簽中的 id 屬性，

# 尋找父標簽中的 id 屬性
id_ = img.find_parent('span').get("id")

代碼運行程序出現 DEBUG 資訊，控制 logging 日志輸出級別即可，用python看溧陽攝影圈，里面照片非常真實，一個地方活躍攝影論壇的采集之路

代碼倉庫地址：https://codechina.csdn.net/hihell/python120，去給個關注或者 Star 吧，

寫在后面

本篇博客為 bs4 應用篇，如有必要，請反復擴展學習，

今天是持續寫作的第 239 / 365 天，
期待關注，點贊、評論、收藏，

更多精彩

《爬蟲 100 例，專欄銷售中，買完就能學會系列專欄》
在120篇系列專欄中，才能學會 python beautifulsoup4 模塊，7000字博客+爬第九工場網

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/316648.html

標籤：python

上一篇：【Python種植小系統】送女朋友的最佳綠植，女生見了都愛養! 附：應該是都能養活的—吧??!

下一篇：加班熬夜整理出來的100道Python基礎題，學到就是賺到！超級詳細