資料決議簡介
資料決議:決議或提取資料,從通用爬蟲獲取的整張頁面中,取得指定的區域資料
- 作用:實作聚焦爬蟲
- 實作方式:
正則(相比來說麻煩一些)bs4(python中獨有的)xpath(java,php,python均可使用)pyquery(python獨有)
- 資料決議的通用原理是什么?
- 決議的一定是html頁面的原始碼資料
- 決議標簽中存盤的文本內容
- 決議標簽屬性的屬性值
- 原理:
- 1,標簽定位
- 2,取文本或者取屬性
- 決議的一定是html頁面的原始碼資料
- 爬蟲實作的流程:
- 指定url
- 發請求
- 獲取回應資料
- 資料決議
- 將決議到的資料持久化存盤
正則決議
正則回顧
單字符:
. : 除換行以外所有字符
[] :[aoe] [a-w] 匹配集合中任意一個字符
\d :數字 [0-9]
\D : 非數字
\w :數字、字母、下劃線、中文
\W : 非\w
\s :所有的空白字符包,括空格、制表符、換頁符等等,等價于 [ \f\n\r\t\v],
\S : 非空白
數量修飾:
* : 任意多次 >=0
+ : 至少1次 >=1
? : 可有可無 0次或者1次
{m} :固定m次 hello{3,}
{m,} :至少m次
{m,n} :m-n次
邊界:
$ : 以某某結尾
^ : 以某某開頭
分組:
(ab)
貪婪模式: .*
非貪婪(惰性)模式: .*?
re.I : 忽略大小寫
re.M :多行匹配
re.S :單行匹配
re.sub(正則運算式, 替換內容, 字串)
案例:取出段子對應的標題
- 需求:使用正則將http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/對應的段子標題取出
import requests
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
url = 'http://duanziwang.com/category/搞笑圖/'
#捕獲到的是字串形式的回應資料
page_text = requests.get(url=url,headers=headers).text
#資料決議
ex = '<div >.*?<a href="http://duanziwang.com/\d+\.html">(.*?)</a></h1>'
ret = re.findall(ex,page_text,re.S)#爬蟲使用正則做決議的話re.S必須要使用
#持久化存盤
with open("./title.txt","a",encoding="utf-8") as f:
for i in ret:
f.write(f"{i}\n")
bs4決議
- 環境安裝
pip install bs4pip install lxml
- 決議原理&實作流程
- 1,實體化一個
BeautifulSoup的物件,需要將等待被決議的頁面原始碼資料加載到該物件中 - 2,呼叫
BeautifulSoup物件中相關的屬性和方法進行標簽定位和文本資料的t提取
- 1,實體化一個
BeautifulSoup如何實體化- 方式1:
BeautifulSoup(等待被決議的原始碼資料,“lxml”),將本地存盤的一張html檔案中的指定資料進行決議 - 方式2:
BeautifulSoup(回應資料,"lxml"),將從互聯網中爬取的資料進行資料決議
- 方式1:
常用的方法和屬性
-
標簽定位:根據標簽名進行定位,只回傳第一個出現的標簽
soup.標簽名回傳當前原始碼中的第一個出現的標簽名
-
屬性定位:根據指定的屬性進行對應標簽的定位
-
soup.find(標簽名,標簽屬性=屬性值)只有class屬性加 class_soup,find("tagName")
-
soup.findall(標簽名,標簽屬性=屬性值)
-
-
選擇器定位
-
soup.select(".類")soup.select("#id") -
層級選擇器
- 大于號:表示一個層級
- 空格:表示多個層級
-
-
取文本
-
標簽定位到的標簽.string
tag.string- 取出標簽下直系的文本內容
-
標簽定位到的標簽.text
tag.text- 取出標簽下所有的文本內容
-
-
取屬性
-
標簽定位到的標簽["字串形式的屬性名稱"]
tag.["attrName"]
-
案例:爬取小說
- 網址:https://www.52bqg.com/book_10508/
# 這種網站最好使用代理池
# 案例1
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
url = "https://www.52bqg.com/book_10508/"
fp = open("./九仙圖.txt","w",encoding="utf-8")
page_text = requests.get(url=url,headers=headers)
page_text.encoding="GBK"
x = page_text.text
soup = BeautifulSoup(x,'lxml')
a_list = soup.select('#list a')
for i in a_list:
title = i.string
a_href = 'https://www.52bqg.com/book_10508/' + i['href']
page_text_a = requests.get(url=a_href,headers=headers)
page_text_a.encoding="GBK"
f = page_text_a.text
a_soup = BeautifulSoup(f,'lxml')
div_tag = a_soup.find('div',id='content')
content = div_tag.text
fp.write("\n" + title + "\n" + content + "\n")
print(title,"下載完成")
fp.close()
- 網址:http://www.balingtxt.com/txtml_84980.html
# 案例2
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
url = "http://www.balingtxt.com/txtml_84980.html"
page_text = requests.get(url=url,headers=headers)
page_text.encoding="utf-8"
page_text = page_text.text
menu_soup = BeautifulSoup(page_text,"lxml")
a_lst = menu_soup.select("#yulan > li > a")
fp = open("./天命相師.txt","w",encoding="utf-8")
for i in a_lst:
title = i.string
a_url = i["href"]
new_text = requests.get(url=a_url,headers=headers)
new_text.encoding="utf-8"
new_text = new_text.text
contcent_soup = BeautifulSoup(new_text,"lxml")
content = contcent_soup.find("div",class_="book_content").text
fp.write(f"{title}\n{content}\n")
print(f"{title} 下載完成!")
fp.close()
圖片資料的爬取
基于requests
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
url = "http://a0.att.hudong.com/78/52/01200000123847134434529793168.jpg"
img_data = https://www.cnblogs.com/Golanguage/p/requests.get(url=url,headers=headers).content # 回傳的是二進制型別的資料
with open("./tiger1.jpg","wb") as f:
f.write(img_data)
content 回傳的是二進制資料,如爬取圖片,音頻,視頻
基于urllib
from urllib import request
url = "http://a0.att.hudong.com/78/52/01200000123847134434529793168.jpg"
ret = request.urlretrieve(url=url,filename="./trger2.jpg")
print(ret)
requests和urllib的區別就是能否實作UA偽裝
Xpath決議
-
環境安裝
pip install lxml
-
決議原理&流程
- 實體化一個
etree的物件,且將被決議的頁面原始碼資料加載到該物件中 - 使用etree物件中的xpath方法結合者不同形式的xpath運算式進行標簽定位和資料的提取
- 實體化一個
-
實體化物件
etree.parse("檔案路徑"):將本地存盤的html檔案中的資料加載到實體化好etree物件中etree.HTML("page_text"):將網路上爬取的資料加載到其中
-
XPath運算式
-
標簽定位
-
最左側的/:必須沖根節點定位標簽(幾乎不用)
-
非最左側的/:表示一個層級
-
最左側的//:可以從任意位置進行指定標簽的定位
-
非最左側的//:表示多個層級(最常用)
-
屬性定位:
"//tagName[@attrName='value']""//標簽名[@屬性名稱=‘屬性值’]"
-
索引定位:
"//標簽名[索引]"# 索引值從1開始,為了方便尋址 -
//div[contains(@class,'ng')]# 所有的div中,class屬性值中包含ng的 -
//div[start-with(@calss,'ta')]# 所有的div中,class屬性值以'ta'開頭的
-
-
取文本
tree.xpath的回傳值之串列,取值時需要索引/text():取直系的文本//text():取出所有的文本
-
取屬性
/@屬性名稱
-
案例:Xpath下載圖片
- 需求:使用Xpath決議圖片地址和名稱且將圖片進行下載保存到本地
- http://pic.netbian.com/4kfengjing/
- 爬取高清大圖
import requests
import os
from lxml import etree
from urllib import request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
dirName = "imglibs" # 存盤圖片的檔案夾名字
if not os.path.exists(dirName):
os.mkdir(dirName) # 沒有該檔案夾就創建
page_url = "http://pic.netbian.com/4kfengjing/index_%d.html"
for page_num in range(1,10): # 全站資料爬取,根據自己情況選擇爬取的頁碼數
if page_num == 1:
url = "http://pic.netbian.com/4kfengjing/"
else:
url = format(page_url%page_num)
page_text = requests.get(url=url,headers=headers).text
# 資料決議
tree = etree.HTML(page_text) # 實體化一個etree物件
img_lst = tree.xpath('//div[@]/ul/li/a') # 回傳一個a標簽的串列
for i in img_lst:
img_href = "http://pic.netbian.com" + i.xpath("./@href")[0] # 路徑拼接,找到大圖頁的地址
img_text = requests.get(url=img_href,headers=headers).text
new_tree = etree.HTML(img_text) # 實體化一個新的etree物件
img_list = new_tree.xpath('//a[@id="img"]/img')[0] # 找到圖片的img標簽物件
img_src = "http://pic.netbian.com" + img_list.xpath('./@src')[0] # 拼接高清大圖的地址
img_alt = img_list.xpath('./@alt')[0].encode('iso-8859-1').decode('GBK') # 找到圖片的名字
filepath = "./" + dirName + "/" + img_alt + ".jpg" # 加.jpg后綴
request.urlretrieve(img_src,filename=filepath) # 持久化存盤
print(img_alt,"下載成功!")
print("Over!")
pyquery決議
- 安裝
pip install pyquery
- 參考
from pyquery import PyQuery as pq
- 簡介
- pyquery是jQuery的一個專供python使用的HTML決議的庫,類似于bs4
使用方法
- 初始化方法
from pyquery import PyQuery as pq
doc =pq(html) #決議html字串
doc =pq("http://news.baidu.com/") #決議網頁
doc =pq("./a.html") #決議html 文本
- 基本CSS選擇器
from pyquery import PyQuery as pq
html = '''
<div id="wrap">
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
print doc("#wrap .s_from link")
運行結果:
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
- 查找子元素
from pyquery import PyQuery as pq
html = '''
<div id="wrap">
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
#查找子元素
doc = pq(html)
items=doc("#wrap")
print(items)
print("型別為:%s"%type(items))
link = items.find('.s_from')
print(link)
link = items.children()
print(link)
運行結果:
<div id="wrap">
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
型別為:<class 'pyquery.pyquery.PyQuery'>
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
- 查找父元素
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
items=doc(".s_from")
print(items)
#查找父元素
parent_href=https://www.cnblogs.com/Golanguage/p/items.parent()
print(parent_href)
運行結果:
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
parent可以查找出外層標簽包括的內容,與之類似的還有parents,可以獲取所有外層節點,
- 查找兄弟元素
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
items=doc("link.active1.a123")
print(items)
#查找兄弟元素
siblings_href=https://www.cnblogs.com/Golanguage/p/items.siblings()
print(siblings_href)
運行結果:
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
根據運行結果可以看出,siblings 回傳了同級的其他標簽
結論:子元素查找,父元素查找,兄弟元素查找,這些方法回傳的結果型別都是pyquery型別,可以針對結果再次進行選擇
- 遍歷查找結果
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it)
運行結果:
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
- 獲取屬性資訊
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.attr('href'))
print(it.attr.href)
運行結果:
http://asda.com
http://asda.com
http://asda1.com
http://asda1.com
http://asda2.com
http://asda2.com
- 獲取文本
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com">asdadasdad12312</link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.text())
運行結果:
asdadasdad12312
asdadasdad12312
asdadasdad12312
- 獲取HTML資訊
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.html())
運行結果:
<a>asdadasdad12312</a>
asdadasdad12312
asdadasdad12312
常用DOM操作
-
添加,移除class標簽
addClass
removeClass
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print("添加:%s"%it.addClass('active1'))
print("移除:%s"%it.removeClass('active1'))
運行結果:
添加:<link href="http://asda.com"><a>asdadasdad12312</a></link>
移除:<link href="http://asda.com"><a>asdadasdad12312</a></link>
添加:<link href="http://asda1.com">asdadasdad12312</link>
移除:<link href="http://asda1.com">asdadasdad12312</link>
添加:<link href="http://asda2.com">asdadasdad12312</link>
移除:<link href="http://asda2.com">asdadasdad12312</link>
需要注意的是已經存在的class標簽不會繼續添加
- attr 為獲取/修改屬性 css 添加style屬性
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link").items()
for it in its:
print("修改:%s"%it.attr('class','active'))
print("添加:%s"%it.css('font-size','14px'))
修改:<link href="http://asda.com"><a>asdadasdad12312</a></link>
添加:<link href="http://asda.com" style="font-size: 14px"><a>asdadasdad12312</a></link>
修改:<link href="http://asda1.com">asdadasdad12312</link>
添加:<link href="http://asda1.com" style="font-size: 14px">asdadasdad12312</link>
修改:<link href="http://asda2.com">asdadasdad12312</link>
添加:<link href="http://asda2.com" style="font-size: 14px">asdadasdad12312</link>
attr css操作直接修改物件的
-
remove
remove 移除標簽
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("div")
print('移除前獲取文本結果:\n%s'%its.text())
it=its.remove('ul')
print('移除后獲取文本結果:\n%s'%it.text())
運行結果:
移除前獲取文本結果:
hello nihao
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
移除后獲取文本結果:
hello nihao
其他DOM方法參考:
http://pyquery.readthedocs.io/en/latest/api.html
偽類選擇器
from pyquery import PyQuery as pq
html = '''
<div href="https://www.cnblogs.com/Golanguage/p/wrap">
hello nihao
<ul >
asdasd
<link class='active1 a123' href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class='active2' href="http://asda1.com">asdadasdad12312</link>
<link class='movie1' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
'''
doc = pq(html)
its=doc("link:first-child")
print('第一個標簽:%s'%its)
its=doc("link:last-child")
print('最后一個標簽:%s'%its)
its=doc("link:nth-child(2)")
print('第二個標簽:%s'%its)
its=doc("link:gt(0)") #從零開始
print("獲取0以后的標簽:%s"%its)
its=doc("link:nth-child(2n-1)")
print("獲取奇數標簽:%s"%its)
its=doc("link:contains('hello')")
print("獲取文本包含hello的標簽:%s"%its)
運行結果:
第一個標簽:<link href="http://asda.com"><a>helloasdadasdad12312</a></link>
最后一個標簽:<link href="http://asda2.com">asdadasdad12312</link>
第二個標簽:<link href="http://asda1.com">asdadasdad12312</link>
獲取0以后的標簽:<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
獲取奇數標簽:<link href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link href="http://asda2.com">asdadasdad12312</link>
獲取文本包含hello的標簽:<link href="http://asda.com"><a>helloasdadasdad12312</a></link>
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/170786.html
標籤:Python
下一篇:python max函式
