CSDN熱榜、華為云博客都可用來練習Python scrapy 爬蟲-有解無憂

這篇博客補充一下 scrapy 選擇器相關知識，

scrapy 選擇器

scrapy 框架自帶資料提取機制，相關內容被稱為選擇器 seletors，其通過 XPath，CSS 運算式可以選擇 HTML 中的指定部分，

scrapy 選擇器是基于 parsel 庫實作的，該庫也是一個決議庫，底層使用的是 lxml，所以它的用法和效率都接近 lxml，在《爬蟲 120 例》專欄后續部分，會針對性的補充一下該庫相關知識點，

selectors 基本使用

本次學習程序中，使用 CSDN 的專欄排行榜進行測驗，

選擇器物件，可以直接通過 response 物件呼叫

import scrapy


class CSpider(scrapy.Spider):
    name = 'c'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list/column']

    def parse(self, response):
        # 選擇器物件，可以直接通過 response 物件呼叫
        print(response.selector)

由于 XPath 和 CSS 選擇器經常被使用，所以使用 response 物件可以直接呼叫這兩個方法，例如：

def parse(self, response):
     # 選擇器物件，可以直接通過 response 物件呼叫
     # print(response.selector)
     response.xpath("XPath 運算式")
     response.css("CSS 運算式")

如果你查看上述兩個方法的原始碼會發現，其核心還是呼叫的 selector 物件的相關方法，
原始碼查閱

def xpath(self, query, **kwargs):
    return self.selector.xpath(query, **kwargs)

def css(self, query):
    return self.selector.css(query)

在代碼的撰寫程序中，使用 response 物件的方法可以滿足大多數需求，但 selector 也適用于部分特殊場景，例如從本地檔案讀入一段 HTML 代碼：

import scrapy
from scrapy.selector import Selector

class CSpider(scrapy.Spider):
    name = 'c'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list/column']

    def parse(self, response):
        body ="""
        <html>
            <head>
                <title>這是一個標題</title>
            </head>
            <body>
                這是內容
            </body>
        </html>
        """
		# 實體化 Selector 物件，并呼叫 xpath 方法
        ret = Selector(text=body).xpath("//title").get()
        print(ret)

通過 scrapy 命令列學習 selectors

使用下述命令進入匹配模式，案例使用的是華為云博客地址，https://bbs.huaweicloud.com/blogs，

> scrapy shell https://bbs.huaweicloud.com/blogs

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000000004E64320>
[s]   item       {}
[s]   request    <GET bbs.huaweicloud.com/blogs>
[s]   response   <200 bbs.huaweicloud.com/blogs>
[s]   settings   <scrapy.settings.Settings object at 0x0000000004E640F0>
[s]   spider     <CSpider 'c' at 0x5161080>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

此時如果輸入 response，就可以得到對應的物件，

>>> response
<200 bbs.huaweicloud.com/blogs>
>>>

嘗試獲取網頁標題

>>> response.xpath("//title")
[<Selector xpath='//title' data='<title>華為云博客_大資料博客_AI博客_云計算博客_開發者中心-華...'>]
>>> response.xpath("//title/text()")
[<Selector xpath='//title/text()' data='華為云博客_大資料博客_AI博客_云計算博客_開發者中心-華為云'>]
>>>

由于得到的資料是序列，所以通過下述方法進行提取，

>>> response.xpath("//title/text()").get()
'華為云博客_大資料博客_AI博客_云計算博客_開發者中心-華為云'
>>> response.xpath("//title/text()").getall()
['華為云博客_大資料博客_AI博客_云計算博客_開發者中心-華為云']
>>>

獲取網頁的 title 屬性意義不大，接下來獲取一下網頁中的博客標題，

>>> response.xpath("//a[@class='blogs-title two-line']/@title").get()
'AppCube實踐之標準頁面開發丨【玩轉應用魔方】'
>>> response.xpath("//a[@class='blogs-title two-line']/@title").getall()
['AppCube實踐之標準頁面開發丨【玩轉應用魔方】', '鴻蒙輕內核M核原始碼分析系列十七（3） 例外資訊ExcInfo', '1024征集令——【有獎征文】玩轉應用魔方，玩轉低代碼構建平臺', '前端需要寫自動化測驗嗎','''''內容省略]

此時你應該已經注意到，想要提取 Selector 物件中的內容，需要使用 get() 和 getall() 方法，它們分別回傳單一元素和多個元素，

CSS 選擇器除了選擇器部分語法外，與 xpath() 方法一致，回傳的都是 SelectorList 物件，
該物件與 Selector 物件一樣，也存在自己的實體方法，例如 xpath()，css()，getall()，get()，re()，re_first()，以及 attrib 屬性，

還有一點需要注意 get() 方法，存在一個別名 extract_first()，也經常被開人人員使用，

在使用 get() 方法時，如果標簽沒有被查找到，可以判斷是否為 None（是回傳 True），或者提供一個默認值，

# 判斷是否為 None
response.xpath("//a[@class='blogs-title']/@title").get() is None
# 提供一個默認值
>>> response.xpath("//a[@class='blogs-title']/@title").get(default='無資料')
'無資料'

上述的 title 屬性也可以不使用 @title，而用物件的 attrib 屬性獲取，下述代碼將獲取匹配到的第一個元素的所有屬性，

>>> response.xpath("//a[@class='blogs-title two-line']").attrib

CSS 選擇器注意以下事項

CSS 選擇器不支持選擇文本節點或者屬性值，所以衍生出下述擴展寫法，

選擇標簽文本，使用 ::text；
選擇屬性值，使用 ::attr(attr_name)，

測驗代碼如下所示：

>>> response.css("a.two-line::text")
>>> response.css("a.two-line::attr(title)")

在上面的文章中，你是否注意到 re() 方法

使用 re() 方法，可以將正則作用于提取結果，例如在提取到的所有標題中，匹配鴻蒙二字開頭的資料，

>>> response.xpath("//a[@class='blogs-title two-line']/@title").re(r'鴻蒙.*')
['鴻蒙輕內核M核原始碼分析系列十七（3） 例外資訊ExcInfo']

XPath 和 CSS 的一個使用場景差異
如果網頁某個元素的 class 屬性特別多，那使用 XPath 會變得不方便，CSS 選擇器更加適合這種場景，
如果用 XPath 模糊匹配，就會出現下述代碼：

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

這種情況下 CSS 選擇器就會變得非常方面，只需要如下一行簡短的代碼即可，

*.someclass

其它補充說明

如果回傳的資料中出現空格，可以使用 Selector 物件的 remove_namespaces() 方法去除空格，

選擇器的使用，很多時候依賴于 XPath 運算式使用的熟練程度，該知識的基本學習，可以參考之前的一篇博客，

這里補充一些高階部分：

starts-with()：判斷開頭內容；
contains：檢測包含內容；
re:text()：可以在里面用正則；
has-class：判斷是否包含某個 class；
normalize-space：去除前后空格，

寫在后面

今天是持續寫作的第 252 / 365 天，
期待關注，點贊、評論、收藏，

更多精彩

《爬蟲 100 例，專欄銷售中，買完就能學會系列專欄》
CSDN熱榜、華為云博客都可用來練習Python scrapy 爬蟲

↓ ↓ ↓ ↓一對一指導你的疑問↓ ↓ ↓ ↓ ↓↓↓掃碼添加博主參加【78技術人社群】~Python分部↓↓↓

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/344257.html

標籤：python

上一篇：使用 Python 進行資料可視化之Bokeh

下一篇：Python爬蟲速度很慢？并發編程了解一下吧