情人節，你剛表白，而我已經開始選哪里拍婚紗照了~-有解無憂

夢想橡皮擦，一個逗趣的互聯網高級網蟲，

又到每年的 2 月 14 日了，最近這幾天，你肯定會在博客上看到，程式員花式秀恩愛，但橡皮擦就不一樣了，正在幫別人選婚紗照拍攝地，

當你 new 出來的物件問你，“北京在哪拍婚紗照便宜又好呀？” 你啪啪啪把資料展示出來，絕對可以贏得你的小可愛那愛戀的眼神，

寫在前面

挖掘目的已經確定，下面就是挖掘代碼撰寫的時間了，作為年輕人，好好秀恩愛吧，苦差事就交給我們這些過來人，

這次咱們的目標網站是：https://www.jiehun.com.cn/，遙想當年橡皮擦的婚紗斬訓是在婚博會上訂的呢~

這個組織在每年春夏秋冬四季在北京、上海、廣州、天津、武漢、杭州、成都等地同時舉辦大型結婚展，

目標頁面長成下面這個樣子，情人節，你剛表白，而我已經開始選哪里拍婚紗照了~
我們要抓取的就是上面商鋪的各種資訊，包含商鋪名，商鋪地址，評星，點評數，價格，

資料抓取程序

在做正式抓取之前，可以先撰寫一個 demo，對資料進行簡單的抓取與決議，具體實作可以參照下文，

其中用到了 XPath 決議，該網站如果不使用 UA 引數，無法獲取到資料，也算是一種最簡單的反爬手段吧，

def demo():
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
    url = f"https://www.jiehun.com.cn/beijing/ch2065/store-p13/?ce=beijing"
    content = r.get(url, headers=headers).text
    html = etree.HTML(content)
    li_list = html.xpath("//div[@id='stlist']/ul/li")
    for li in li_list:
        # 發現評分
        star = li.xpath("./div[@class='comment']/p[1]/b/text()")
        comment = "--"
        if star:
            # 評星
            star = star[0]
            comment = li.xpath("./div[@class='comment']/p[2]/a/text()")
            # 評論
            comment = comment[0]
        else:
            star = "--"

        # 店鋪名稱
        name = li.xpath(".//a[@class='namelimit']/text()")[0]
        # 地址
        store = li.xpath(
            ".//div[@class='storename']/following-sibling::p[1]/text()")[0]
        # 價錢
        price = li.xpath(
            ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")
        if price:
            price = li.xpath(
                ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")[0]
        else:
            price = "--"
        item = {
            "name": name,
            "store": store,
            "price": price,
            "star": star,
            "comment": comment
        }
        print(item)

        with open("hun.json", "ab+") as filename:
            filename.write(json.dumps(
                item, ensure_ascii=False).encode("utf-8") + b"\n")

獲取到的資料存盤格式如下，以 JSON 格式存盤，讀取的時候每次讀取一行即可，

{"name": "9Xi·婚紗攝影", "store": "商家地址：北京市朝陽區朝外大街丙10號9Xi結婚匯購物中心", "price": "￥4999", "star": "--", "comment": "--"}
{"name": "非目環球旅拍", "store": "商家地址：杭州市濱江區非目影像(總店)", "price": "￥19800", "star": "--", "comment": "--"}
{"name": "小白作業室(私人會所)", "store": "商家地址：朝陽北路天鵝灣北區7號樓二單元502(朝陽大悅城對面)", "price": "--", "star": "--", "comment": "--"}
{"name": "朵美婚拍", "store": "商家地址：北京市朝陽區廣渠門外大街8號優士閣A座大堂底商", "price": "￥2999", "star": "--", "comment": "--"}
{"name": "柏悅時尚藝術館", "store": "商家地址：立湯路186號龍德廣場四層F420A", "price": "--", "star": "--", "comment": "--"}

當測驗資料抓取到之后，就可以對全北京的商鋪（其他地區的修改對應地址即可）進行批量抓取了，本次資料量雖然不大，但是橡皮擦依舊為你貼心的準備了多執行緒爬蟲（唉~沒那么容易學習到爬蟲技術），

以下是完整代碼部分，就為了你能獲取到最全的資料，一次性的把代碼都提供給你了，你的手指可以放在點贊按鈕上，為橡皮擦點贊了，

import threading
import requests as r
from queue import Queue
import time
from lxml import etree
import json

CRAWL_EXIT = False
PARSE_EXIT = False


class ThreadCrawl(threading.Thread):
    def __init__(self, thread_name, page_queue, data_queue):
        super(ThreadCrawl, self).__init__()

        self.thread_name = thread_name
        self.page_queue = page_queue
        self.data_queue = data_queue
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}

    def run(self):
        print("啟動", self.thread_name)
        while not CRAWL_EXIT:
            try:
                #
                page = self.page_queue.get(False)
                url = f"https://www.jiehun.com.cn/beijing/ch2065/store-p{page}/?ce=beijing"

                content = r.get(url, headers=self.headers).text
                time.sleep(1)

                self.data_queue.put(content)

            except Exception as e:
                print(e)

        print("結束", self.thread_name)


class ThreadParse(threading.Thread):
    def __init__(self, thread_name, data_queue, filename, lock):
        super(ThreadParse, self).__init__()
        self.thread_name = thread_name
        self.data_queue = data_queue
        self.filename = filename
        self.lock = lock

    def run(self):
        print("啟動", self.thread_name)
        while not PARSE_EXIT:
            try:
                html = self.data_queue.get(False)

                self.parse(html)

            except Exception as e:
                print(e)
        print("結束", self.thread_name)

    def parse(self, html):
        html = etree.HTML(html)

        li_list = html.xpath("//div[@id='stlist']/ul/li")
        for li in li_list:
            # 發現評分
            star = li.xpath("./div[@class='comment']/p[1]/b/text()")
            comment = "--"
            if star:
                # 評星
                star = star[0]
                comment = li.xpath("./div[@class='comment']/p[2]/a/text()")
                # 評論
                comment = comment[0]
            else:
                star = "--"

            # 店鋪名稱
            name = li.xpath(".//a[@class='namelimit']/text()")[0]
            # 地址
            store = li.xpath(
                ".//div[@class='storename']/following-sibling::p[1]/text()")[0]
            # 價錢
            price = li.xpath(
                ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")
            if price:
                price = li.xpath(
                    ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")[0]
            else:
                price = "--"

            item = {
                "name": name,
                "store": store,
                "price": price,
                "star": star,
                "comment": comment
            }

            with self.lock:
                self.filename.write(json.dumps(
                    item, ensure_ascii=False).encode("utf-8") + b"\n")


def main():
    # 頁碼
    page_queue = Queue(14)
    for i in range(1, 15):
        page_queue.put(i)

    # 資料挖掘結果
    data_queue = Queue()
    filename = open("hun.json", "ab+")

    # 鎖
    lock = threading.Lock()

    # 三個挖掘執行緒
    crawl_list = ["挖掘執行緒1", "挖掘執行緒2", "挖掘執行緒3"]

    threadcrawl = []
    for thread_name in crawl_list:
        thread = ThreadCrawl(thread_name, page_queue, data_queue)
        thread.start()
        threadcrawl.append(thread)

    # 三個決議執行緒
    parse_list = ["決議執行緒1", "決議執行緒2", "決議執行緒3"]
    threadparse = []
    for thread_name in parse_list:
        thread = ThreadParse(thread_name, data_queue, filename, lock)
        thread.start()
        threadparse.append(thread)

    # 等待 page_queue 佇列為空
    while not page_queue.empty():
        pass

    global CRAWL_EXIT
    CRAWL_EXIT = True

    print("page_queue為空")
    for thread in threadcrawl:
        thread.join()
        print("挖掘佇列執行完畢")


    while not data_queue.empty():
        pass
    global PARSE_EXIT
    PARSE_EXIT = True

    for thread in threadparse:
        thread.join()
        print("決議佇列執行完畢")

    with lock:
        filename.close()

本次資料獲取，為了不讓你那么容易就找到哪個商鋪價錢便宜，我專門存盤成了 JSON 格式，排序的作業就交給你自己來完成了，畢竟你不能有女朋友的同時，一點力也不出吧，
情人節，你剛表白，而我已經開始選哪里拍婚紗照了~

看圖選照

你以為這樣作業就做完了嗎？當然沒有，除了價格以外，咱們 new 出來的物件還需要看圖選呢，至少要看看誰家拍攝技術高，更符合自己的調調，

接下來，再爬取一個：目標地址

該網站頁面資料如下：

情人節，你剛表白，而我已經開始選哪里拍婚紗照了~
要抓取的就是這幾千張婚紗攝影照片~

該操作我將其分解成了兩個步驟：

第一個步驟批量采集圖片詳情頁的地址；
第二步針對詳情頁地址，獲取圖片，

核心部分代碼修改如下：

抓取超鏈接部分，修改 XPath 決議地址即可，

def demo():
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
    url = f"https://www.jiehun.com.cn/beijing/ch2065/album-p461/?attr_110=&cate_id=2065&ce=beijing"
    content = r.get(url, headers=headers).text
    html = etree.HTML(content)
    a_list = html.xpath("//div[@class='rectangle_list']/ul/li/a/@href")

    for a in a_list:
        with open("album_link.json", "a+") as filename:
            filename.write(f"https://www.jiehun.com.cn/{a}\n")

短暫運行一段時間后，得到超鏈接資料如下：

https://www.jiehun.com.cn//album/730704/
https://www.jiehun.com.cn//album/730703/
https://www.jiehun.com.cn//album/730702/
https://www.jiehun.com.cn//album/730701/
https://www.jiehun.com.cn//album/730700/
https://www.jiehun.com.cn//album/730699/
https://www.jiehun.com.cn//album/730698/
https://www.jiehun.com.cn//album/730697/
https://www.jiehun.com.cn//album/730696/
https://www.jiehun.com.cn//album/730695/
https://www.jiehun.com.cn//album/730694/
https://www.jiehun.com.cn//album/730693/
https://www.jiehun.com.cn//album/730679/

圖片抓取，利用 album_link.json 中存盤的鏈接地址，決議對應頁面中的 img 標簽，

讀取 album_link.json 中的資料，生成待抓取鏈接，

def read_file():
    page_queue = Queue()
    for f in open("album_link.json","r"):
        print(f.strip())
        page_queue.put(f)

    print(page_queue.qsize())

提取對應 URL 中的圖片地址，并保存，

def demo():
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
    url = f"https://www.jiehun.com.cn/album/730698/"
    content = r.get(url, headers=headers).text
    html = etree.HTML(content)
    title = html.xpath("//div[@class='detailintro_l']/h2/text()")[0]
    img_list = html.xpath("//div[@class='img']/img/@src")

    for index,img_url in enumerate(img_list):
        content = r.get(img_url, headers=headers).content
        with open(f"./imgs/{title}-{index}.jpg", "wb+") as filename:
            filename.write(content)