簡單例子展示爬蟲在不同思想下的寫法
注:以用戶選擇爬取百度貼吧不同主題的所選頁面的原始碼為目的展現各種寫法,各有各的好處,重點在于自己思考,
爬取前,首先找規律,
注意:因為發現其url有這樣的規律所有可采用這種方法,
1.把不同頁面url地址騰到一個地方,做一些判斷和修改,再用修改后的url去搜索,看是否為正確的url,由下面的url:
(修改:對url不同地方做一些刪減)
(測驗:用瀏覽器打開修改后的url,看是否得到目標頁面)
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=0 第一頁
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50 第二頁
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100 第三頁
得出以下格式:https://tieba.baidu.com/f?kw=(主題)[&ie=utf-8&]pn=(頁數*50-50)
注:1.中括號內內容可有可無,(由修改和測驗可得)
2.括號內是我們需要操作的,可傳參的地方,
1.普通寫法
from urllib import parse
from urllib import request
name = input('選擇您要查看的主題:')
start = int(input('選擇起始頁:'))
end = int(input('選擇結束頁:'))
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko)"
" Chrome/90.0.4430.85 Safari/537.36 Edg/90.0.818.46"
}
base_url = 'https://tieba.baidu.com/f?kw='
for i in range(start, end + 1):
num = (i - 1) * 50
url = base_url + parse.quote(name) + '&ie=utf-8&pn=' + str(num)
req = request.Request(url, headers=headers)
response = request.urlopen(req)
html = response.read().decode('utf-8')
file_name = '第' + str(i) + '頁內容.html'
with open(file_name, 'w', encoding='utf-8') as file_obj:
print('正在爬取第%d頁' % i)
file_obj.write(html)
2.函式式寫法
from urllib import parse
from urllib import request
# 獲取資料
def read_url(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 "
"(KHTML, like Gecko)"
" Chrome/90.0.4430.85 Safari/537.36 Edg/90.0.818.46"
}
req = request.Request(url, headers=headers)
response = request.urlopen(req)
html = response.read().decode('utf-8')
return html
# 寫入資料
def write_page(file_name, html):
with open(file_name, 'w', encoding='utf-8') as file_obj:
file_obj.write(html)
print("寫入成功")
# 主函式,其他的都寫入其中
def main():
name = input('選擇您要查看的主題:')
start = int(input('選擇起始頁:'))
end = int(input('選擇結束頁:'))
base_url = 'https://tieba.baidu.com/f?kw='
for i in range(start, end + 1):
num = (i - 1) * 50
url = base_url + parse.quote(name) + '&ie=utf-8&pn=' + str(num)
file_name = '第' + str(i) + '頁內容.html'
html = read_url(url)
write_page(file_name, html)
if __name__ == '__main__':
main()
3.面向物件寫法
from urllib import parse
from urllib import request
class BaiduSpider:
def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko)"
" Chrome/90.0.4430.85 Safari/537.36 Edg/90.0.818.46"
}
self.base_url = 'https://tieba.baidu.com/f?kw='
def read_page(self, url):
req = request.Request(url, headers=self.headers)
response = request.urlopen(req)
html = response.read().decode('utf-8')
return html
def write_page(self, file_name, html):
with open(file_name, 'w', encoding='utf-8') as file_obj:
file_obj.write(html)
print("寫入成功")
def main(self):
name = input('選擇您要查看的主題:')
start = int(input('選擇起始頁:'))
end = int(input('選擇結束頁:'))
for i in range(start, end + 1):
num = (i - 1) * 50
url = self.base_url + parse.quote(name) + '&ie=utf-8&pn=' + str(num)
file_name = '第' + str(i) + '頁內容.html'
html = self.read_page(url)
self.write_page(file_name, html)
if __name__ == '__main__':
yes = BaiduSpider()
yes.main()
以上便是我所表達的內容,不同寫法的特點我并沒有表述出來,因為我還是一個小白,需要再有更深的接觸后才能更清楚的理解不同寫法的特點吧,同樣,小白寫給小白的這篇文章還要麻煩小白們自己感悟了,
有什么問題和感悟可以留下來一起討論討論哦!
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/279810.html
標籤:python
下一篇:爬取原神同人社的pljj照片
