XPath 語言

XPath（XML Path Language）是XML路徑語言,它是一種用來定位XML檔案中某部分位置的語言，

學習目的

將HTML轉換成XML檔案之后，用XPath查找HTML節點或元素

比如用“/”來作為上下層級間的分隔，第一個“/”表示檔案的根節點（注意，不是指檔案最外層的tag節點，而是指檔案本身），

比如對于一個HTML檔案來說，最外層的節點應該是"/html"，

XPath開發工具

開源的XPath運算式編輯工具:XMLQuire(XML格式檔案可用)
chrome插件 XPath Helper
直接在console 里面輸入 $x("xpath選擇器")

firefox插件 XPath Checker

XPath語法參考檔案：

http://www.w3school.com.cn/xpath/index.asp

XPath語法

XPath 是一門在 XML 檔案中查找資訊的語言，

XPath 可用來在 XML 檔案中對元素和屬性進行遍歷，

<?xml version="1.0" encoding="ISO-8859-1"?><bookstore><book>  <title lang="eng">Harry Potter</title>  <price>29.99</price></book><book>  <title lang="eng">Learning XML</title>  <price>39.95</price></book></bookstore>

選取節點XPath 使用路徑運算式在 XML 檔案中選取節點，節點是通過沿著路徑或者 step 來選取的，

下面列出了最有用的路徑運算式：

運算式	描述
/	從根節點選取，
nodename	選取此節點的所有子節點，
//	從當前節點選擇所有匹配檔案中的節點
.	選取當前節點，
..	選取當前節點的父節點，
@	選取屬性，

實體

在下面的表格中，我們已列出了一些路徑運算式以及運算式的結果：

路徑運算式	結果
/bookstore	選取根元素 bookstore，注釋：假如路徑起始于正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore	選取 bookstore 元素的所有子節點，默認從根節點選取
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素，
//book	選取所有 book 子元素，而不管它們在檔案中的位置，
//book/./title	選取所有 book 子元素，從當前節點查找title節點
//price/..	選取所有 book 子元素，從當前節點查找父節點
bookstore//book	選擇屬于 bookstore 元素的后代的所有 book 元素，而不管它們位于 bookstore 之下的什么位置，
//@lang	選取名為 lang 的所有屬性，

謂語條件（Predicates）

謂語用來查找某個特定的資訊或者包含某個指定的值的節點，
所謂"謂語條件"，就是對路徑運算式的附加條件
謂語是被嵌在方括號中，都寫在方括號"[]"中，表示對節點進行進一步的篩選，

實體

在下面的表格中，我們列出了帶有謂語的一些路徑運算式，以及運算式的結果：

路徑運算式	結果
/bookstore/book[1]	選取屬于 bookstore 子元素的第一個 book 元素，
/bookstore/book[last()]	選取屬于 bookstore 子元素的最后一個 book 元素，
/bookstore/book[last()-1]	選取屬于 bookstore 子元素的倒數第二個 book 元素，
/bookstore/book[position()??]	選取最前面的兩個屬于 bookstore 元素的子元素的 book 元素，
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素，
//title[@lang=’eng’]	選取所有 title 元素，且這些元素擁有值為 eng 的 lang 屬性，
//book[price]	選取所有 book 元素，且被選中的book元素必須帶有price子元素
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大于 35.00，
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00，

選取未知節點

XPath 通配符可用來選取未知的 XML 元素，

通配符	描述
*	匹配任何元素節點，
@	匹配任何屬性節點，

實體

在下面的表格中，我們列出了一些路徑運算式，以及這些運算式的結果：

路徑運算式	結果
/bookstore/	選取 bookstore 元素的所有子元素，
//	選取檔案中的所有元素，
//title[@*]	選取所有帶有屬性的 title 元素，

選取若干路徑

通過在路徑運算式中使用“|”運算子，您可以選取若干個路徑，

實體

在下面的表格中，我們列出了一些路徑運算式，以及這些運算式的結果：

路徑運算式	結果
//book/title	//book/price
//title	//price
/bookstore/book/title	//price

XPath 高級用法

模糊查詢 contains

目前許多web框架，都是動態生成界面的元素id，因此在每次操作相同界面時，ID都是變化的，這樣為自動化測驗造成了一定的影響，

<div  title="請輸入用戶名"> <input type="text"  name="ID9sLJQnkQyLGLhYShhlJ6gPzHLgvhpKpLzp2Tyh4hyb1b4pnvzxFR!-166749344!1357374592067" id="nt1357374592068" /> </div>

解決方法使用xpath的匹配功能，//input[contains(@id,'nt')]

測驗使用的XML

<Root>?
    <Person ID="1001" >
        ?<Name lang="zh-cn" >張城斌</Name>?
        <Email xmlns="www.quicklearn.cn" > [email protected] </Email>?          
        <Blog>http://cbcye.cnblogs.com</Blog>
    ?</Person>?
    <Person ID="1002" >
       ?<Name lang="en" >Gary Zhang</Name>?
      <Email xmlns="www.quicklearn.cn" > [email protected]</Email>?    
       <Blog>http://www.quicklearn.cn</Blog>?
    </Person>?
</Root>

查詢所有Blog節點值中帶有 cn 字串的Person節點Xpath運算式：

/Root//Person[contains(Blog,'cn')]

查詢所有Blog節點值中帶有 cn 字串并且屬性ID值中有01的Person節點

Xpath運算式：

/Root//Person[contains(Blog,'cn') and contains(@ID,'01')]

學習筆記

1.依靠自己的屬性，文本定位

//td[text()='Data Import']?//div[contains(@class,'cux-rightArrowIcon-on')]?//a[text()='馬上注冊']?//input[@type='radio' and @value='https://www.cnblogs.com/bigzql/p/1']    
 多條件
?//span[@name='bruce'][text()='bruce1'][1]   
多條件?
//span[@id='bruce1' or text()='bruce2']  
找出多個
?//span[text()='bruce1' and text()='bruce2']

2.依靠父節點定位

//div[@class='x-grid-col-name x-grid-cell-inner']/div?//div[@id='dynamicGridTestInstanceformclearuxformdiv']/div?//div[@id='test']/input

3.依靠子節點定位

//div[div[@id='navigation']]?
//div[div[@name='listType']]?
//div[p[@name='testname']]

4.混合型

//div[div[@name='listType']]
//img?
//td[a/font[contains(text(),'seleleium2從零開始 視屏')]]
//input[@type='checkbox']

5.進階部分

 //input[@id='123']/following-sibling::input   
找下一個兄弟節點

?//input[@id='123']/preceding-sibling::span    
上一個兄弟節點

?//input[starts-with(@id,'123')]               
以什么開頭?

//span[not(contains(text(),'xpath')）]        
不包含xpath欄位的span

6.索引

//div/input[2]?
//div[@id='position']/span[3]?
//div[@id='position']/span[position()=3]?
//div[@id='position']/span[position()>3]
//div[@id='position']/span[position()<3]?
//div[@id='position']/span[last()]
?//div[@id='position']/span[last()-1]

7.substring 截取判斷

<div data-for="result" id="swfEveryCookieWrap"></div>?
//*[substring(@id,4,5)='Every']/@id  截取該屬性 定位3,取長度5的字符?
//*[substring(@id,4)='EveryCookieWrap']  截取該屬性從定位3 到最后的字符
?//*[substring-before(@id,'C')='swfEvery']/@id   屬性 'C'之前的字符匹配
?//*[substring-after(@id,'C')='ookieWrap']/@id   屬性'C之后的字符匹配

8.通配符*

//span[@*='bruce']?
//*[@name='bruce']

9.軸

//div[span[text()='+++current node']]/parent::div    
找父節點?

//div[span[text()='+++current node']]/ancestor::div    
找祖先節點

10.孫子節點

//div[span[text()='current note']]/descendant::div/span[text()='123']
?//div[span[text()='current note']]
//div/span[text()='123']         
 兩個表達的意思一樣

11.following pre

// span[@]/../following::a       
往下的所有a

// span[@]/../preceding::a[1]    
往上的所有a

xpath提取多個標簽下的text

在寫爬蟲的時候，經常會使用xpath進行資料的提取，對于如下的代碼：

<div id="test1">大家好！</div>

使用xpath提取是非常方便的，假設網頁的源代碼在selector中：

data = https://www.cnblogs.com/bigzql/p/selector.xpath('//div[@id="test1"]/text()').extract()[0]

就可以把“大家好！”提取到data變數中去，

然而如果遇到下面這段代碼呢？

<div id="test2">美女，<font color=red>你的微信是多少？</font><div>

如果使用：

data = https://www.cnblogs.com/bigzql/p/selector.xpath('//div[@id="test2"]/text()').extract()[0]

只能提取到“美女，”；

如果使用：

data = https://www.cnblogs.com/bigzql/p/selector.xpath('//div[@id="test2"]/font/text()').extract()[0]

又只能提取到“你的微信是多少？”

可是我本意是想把“美女，你的微信是多少？”這一整個句子提取出來，

<div id="test3">
    我左青龍，
    <span id="tiger">右白虎，
    <ul>上朱雀，
        <li>下玄武，</li>
    </ul>老牛在當中，</span>
        龍頭在胸口，
<div>

而且內部的標簽還不固定，如果我有一百段這樣類似的html代碼，又如何使用xpath運算式，以最快最方便的方式提取出來？

使用xpath的string(.)

以第三段代碼為例：

data = https://www.cnblogs.com/bigzql/p/selector.xpath('//div[@id="test3"]')
info = data.xpath('string(.)').extract()[0]

這樣，就可以把“我左青龍，右白虎，上朱雀，下玄武，老牛在當中，龍頭在胸口”整個句子提取出來，賦值給info變數，

IT入門感謝關注 | 練習地址：www.520mg.com/it

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/168765.html

標籤：Python

上一篇：第2天 | 12天搞定Python，運行環境(超詳細步驟)

下一篇：11-python爬蟲之Beautiful Soup