爬取廈門人才網（python爬蟲）-有解無憂

一、資料來源

爬取網站是廈門人才網：

http://www.xmrc.com.cn/net/info/resultg.aspx?
要爬取的內容包括了相關職業的職位名稱、詳情鏈接、招聘公司、參考薪水、作業地點、學歷要求以及發布時間

二、匯入庫

使用urllib.request發送請求 from lxml import etree

通過xpath決議DOM樹的時候會使用lxml的etree，可以從html原始碼中得到想要的內容

所以先匯入這兩個庫

import urllib.request  
from lxml import etree

三、定義類

定義一個Spider類，也就是爬蟲類，在類中除了定義構造方法，再定義一個方法，用來接收html的內容，

class Spider(object):  # 定義一個Spider類
    def __init__(self):  # 構造方法
        # 起始頁位置
        self.begin_page = int(input("請輸入起始頁："))
        # 終止頁位置
        self.end_page = int(input("請輸入終止頁："))
        # 基本URL
        self.base_url = "http://www.xmrc.com.cn/net/info/resultg.aspx?"

    def load_page(self):  # 定義一個方法，用來接收html的內容
        """
            @brief 定義一個url請求網頁的方法
            @param page 需要請求的第幾頁
        """
        # 添加User-Agent欄位對發出的請求進行偽裝
        user_agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident / 5.0"
        headers = {"User-Agent": user_agent}
        # 找到url的規律回圈每一頁的內容
        # 第一頁是http://www.xmrc.com.cn/net/info/resultg.aspx?=position.php?&start=20&PageIndex=1
        # 第二頁是http://www.xmrc.com.cn/net/info/resultg.aspx?=position.php?&start=20&PageIndex=2
        html_list = []  # 用來存放html原始碼的串列
        for page in range(self.begin_page, self.end_page + 1):
            url = self.base_url + "=position.php?&start=20&PageIndex=" + str(page)
            request = urllib.request.Request(url, headers=headers)
            # 獲取每頁HTML原始碼字串
            response = urllib.request.urlopen(request)
            # 指定utf-8編碼格式解碼字串
            html = response.read().decode("utf-8")
            # print(html)  # 用來測驗爬取的html是否成功
            html_list.append(html)  # 把每一個html放到一個空串列中，以便后面讀取后再決議
        return html_list  # 回傳html的串列，里面的每一個元素都是一個html原始碼

    # 使用lxml庫決議網頁資料
    def parse_page(self, list):  # 引數list是傳入一個html的串列
        """
            @brief      定義一個決議網頁的方法
            @param html 服務器回傳的網頁HTML
        """
        items = []  # 定義空串列，以保存元素的資訊
        for every_html in list:  # 依次取出html
            # 從字串中決議HTML檔案或片段，回傳根節點
            root = etree.HTML(every_html)
            # 查找所有的職位名稱
            names = root.xpath("//tr[@class='bg']/td[2]/a")
            # 查找所有的詳情鏈接
            links = root.xpath("//tr[@class='bg']/td/a/@href")
            # 查找所有的招聘公司
            company = root.xpath("//tr[@class='bg']/td[3]/a")
            # 查找所有的參考薪水
            salary = root.xpath("//tr[@class='bg']/td[5]/a")
            # 查找所有的作業地點
            locations = root.xpath("//tr[@class='bg']/td[4]/a")
            # 查找所有的學歷要求
            education = root.xpath("//tr[@class='bg']/td[6]/a")
            # 查找所有的發布時間
            publish_times = root.xpath("//tr[@class='bg']/td[7]/a")
            for i in range(0, len(names)):  # 回圈一共有多少個職位的次數
                item = {}  # 創建空字典
                # 寫入鍵值對，通過lxml庫決議網頁的資料填入對應的值，并使用.strip()去掉空格
                item["職位名稱"] = names[i].text.strip()
                item["詳情鏈接"] = self.base_url + links[i]
                item["招聘公司"] = company[i].text.strip()
                item["參考薪水"] = salary[i].text.strip()
                item["作業地點"] = locations[i].text.strip()
                item["學歷要求"] = education[i].text.strip()
                item["發布時間"] = publish_times[i].text.strip()
                items.append(item)
        # print(len(items))   # 用來測驗資料條數是否正確，一個頁面是30條資料，依次類推
        return items  # 回傳串列，串列的每一個元素是一個招聘的字典

    def save_file(self, items):  # 把字典里的資料保存到一個txt檔案中
        """
            @brief       將資料追加寫進檔案中

            @param html 檔案內容
        """
        file = open('tencent.txt', "wb+")  # 打開檔案，以二進制的方式寫入
        file.write(str(items).encode())  # 寫入資料
        file.close()  # 關閉檔案

以上注釋里都寫得很詳細，可以適當內容，爬取自己想要的資料

下面附上完整的代碼：

# coding=utf-8
import urllib.request  # 使用urllib.request發送請求
from lxml import etree  # 通過xpath決議DOM樹的時候會使用lxml的etree，可以從html原始碼中得到想要的內容


class Spider(object):  # 定義一個Spider類
    def __init__(self):  # 構造方法
        # 起始頁位置
        self.begin_page = int(input("請輸入起始頁："))
        # 終止頁位置
        self.end_page = int(input("請輸入終止頁："))
        # 基本URL
        self.base_url = "http://www.xmrc.com.cn/net/info/resultg.aspx?"

    def load_page(self):  # 定義一個方法，用來接收html的內容
        """
            @brief 定義一個url請求網頁的方法
            @param page 需要請求的第幾頁
        """
        # 添加User-Agent欄位對發出的請求進行偽裝
        user_agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident / 5.0"
        headers = {"User-Agent": user_agent}
        # 找到url的規律回圈每一頁的內容
        # 第一頁是http://www.xmrc.com.cn/net/info/resultg.aspx?=position.php?&start=20&PageIndex=1
        # 第二頁是http://www.xmrc.com.cn/net/info/resultg.aspx?=position.php?&start=20&PageIndex=2
        html_list = []  # 用來存放html原始碼的串列
        for page in range(self.begin_page, self.end_page + 1):
            url = self.base_url + "=position.php?&start=20&PageIndex=" + str(page)
            request = urllib.request.Request(url, headers=headers)
            # 獲取每頁HTML原始碼字串
            response = urllib.request.urlopen(request)
            # 指定utf-8編碼格式解碼字串
            html = response.read().decode("utf-8")
            # print(html)  # 用來測驗爬取的html是否成功
            html_list.append(html)  # 把每一個html放到一個空串列中，以便后面讀取后再決議
        return html_list  # 回傳html的串列，里面的每一個元素都是一個html原始碼

    # 使用lxml庫決議網頁資料
    def parse_page(self, list):  # 引數list是傳入一個html的串列
        """
            @brief      定義一個決議網頁的方法
            @param html 服務器回傳的網頁HTML
        """
        items = []  # 定義空串列，以保存元素的資訊
        for every_html in list:  # 依次取出html
            # 從字串中決議HTML檔案或片段，回傳根節點
            root = etree.HTML(every_html)
            # 查找所有的職位名稱
            names = root.xpath("//tr[@class='bg']/td[2]/a")
            # 查找所有的詳情鏈接
            links = root.xpath("//tr[@class='bg']/td/a/@href")
            # 查找所有的招聘公司
            company = root.xpath("//tr[@class='bg']/td[3]/a")
            # 查找所有的參考薪水
            salary = root.xpath("//tr[@class='bg']/td[5]/a")
            # 查找所有的作業地點
            locations = root.xpath("//tr[@class='bg']/td[4]/a")
            # 查找所有的學歷要求
            education = root.xpath("//tr[@class='bg']/td[6]/a")
            # 查找所有的發布時間
            publish_times = root.xpath("//tr[@class='bg']/td[7]/a")
            for i in range(0, len(names)):  # 回圈一共有多少個職位的次數
                item = {}  # 創建空字典
                # 寫入鍵值對，通過lxml庫決議網頁的資料填入對應的值，并使用.strip()去掉空格
                item["職位名稱"] = names[i].text.strip()
                item["詳情鏈接"] = self.base_url + links[i]
                item["招聘公司"] = company[i].text.strip()
                item["參考薪水"] = salary[i].text.strip()
                item["作業地點"] = locations[i].text.strip()
                item["學歷要求"] = education[i].text.strip()
                item["發布時間"] = publish_times[i].text.strip()
                items.append(item)
        # print(len(items))   # 用來測驗資料條數是否正確，一個頁面是30條資料，依次類推
        return items  # 回傳串列，串列的每一個元素是一個招聘的字典

    def save_file(self, items):  # 把字典里的資料保存到一個txt檔案中
        """
            @brief       將資料追加寫進檔案中

            @param html 檔案內容
        """
        file = open('tencent.txt', "wb+")  # 打開檔案，以二進制的方式寫入
        file.write(str(items).encode())  # 寫入資料
        file.close()  # 關閉檔案


if __name__ == '__main__':  # 主函式，程式的入口
    # 測驗正則運算式/ lxml庫/ bs4庫
    spider = Spider()  # 創建Spider的物件
    html_list = spider.load_page()  # 呼叫load_page()回傳賦給html_list
    return_items = spider.parse_page(html_list)  # 呼叫parse_page()回傳賦給return_items
    # 呼叫save_file()把資料保存到檔案中
    spider.save_file(return_items)

結果圖：

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/292499.html

標籤：python

上一篇：Python爬蟲應用

下一篇：VSCode使用ssh密鑰免密遠程登錄服務器&配置解釋器