文章目錄
- 前言
- 一、決議淘寶URL組成
- 二、查看網頁原始碼并用re庫提取資訊
- 1.查看原始碼
- 2.re庫提取資訊
- 三:函式填寫
- 四:主函式填寫
- 五:完整代碼
前言
本文簡單使用python的requests庫及re正則運算式對淘寶的商品資訊(商品名稱,商品價格,生產地區,以及銷售額)進行了爬取,并最后用xlsxwriter庫將資訊放入Excel表格,最后的效果圖如下:

提示:以下是本篇文章正文內容
一、決議淘寶URL組成
1.我們的第一個需求就是要輸入商品名字回傳對應的資訊
所以我們這里隨便選一個商品來觀察它的URL,這里我們選擇的是書包,打開網頁,可知他的URL為:
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306
可能單單從這個url里我們看不出什么,但是我們可以從圖中看出一些端倪

我們發現q后面的引數就是我們要獲取的物品的名字
2.我們第二個需求就是根據輸入的數字來爬取商品的頁碼
所以我們來觀察一下后面幾頁URL的組成

由此我們可以得出分頁的依據是最后s的值=(44(頁數-1))
二、查看網頁原始碼并用re庫提取資訊
1.查看原始碼

這里的幾個資訊都是我們所需要的
2.re庫提取資訊
a = re.findall(r'"raw_title":"(.*?)"', html)
b = re.findall(r'"view_price":"(.*?)"', html)
c = re.findall(r'"item_loc":"(.*?)"', html)
d = re.findall(r'"view_sales":"(.*?)"', html)
三:函式填寫
這里我寫了三個函式,第一個函式來獲取html網頁,代碼如下:
def GetHtml(url):
r = requests.get(url,headers =headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r
第二個用于獲取網頁的URL代碼如下:
def Geturls(q, x):
url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \
"=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "
urls = []
urls.append(url)
if x == 1:
return urls
for i in range(1, x ):
url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \
"&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \
"&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str(
i * 44)
urls.append(url)
return urls
第三個用于獲取我們需要的商品資訊并寫入Excel表格代碼如下:
def GetxxintoExcel(html):
global count#定義一個全域變數count用于后面excel表的填寫
a = re.findall(r'"raw_title":"(.*?)"', html)#(.*?)匹配任意字符
b = re.findall(r'"view_price":"(.*?)"', html)
c = re.findall(r'"item_loc":"(.*?)"', html)
d = re.findall(r'"view_sales":"(.*?)"', html)
x = []
for i in range(len(a)):
try:
x.append((a[i],b[i],c[i],d[i]))#把獲取的資訊放入新的串列中
except IndexError:
break
i = 0
for i in range(len(x)):
worksheet.write(count + i + 1, 0, x[i][0])#worksheet.write方法用于寫入資料,第一個數字是行位置,第二個數字是列,第三個是寫入的資料資訊,
worksheet.write(count + i + 1, 1, x[i][1])
worksheet.write(count + i + 1, 2, x[i][2])
worksheet.write(count + i + 1, 3, x[i][3])
count = count +len(x) #下次寫入的行數是這次的長度+1
return print("已完成")
四:主函式填寫
if __name__ == "__main__":
count = 0
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
,"cookie":""#cookie 是每個人獨有的,因為反爬機制的緣故,爬取太快可能到后面要重新重繪一下自己的Cookie,
}
q = input("輸入貨物")
x = int(input("你想爬取幾頁"))
urls = Geturls(q,x)
workbook = xlsxwriter.Workbook(q+".xlsx")
worksheet = workbook.add_worksheet()
worksheet.set_column('A:A', 70)
worksheet.set_column('B:B', 20)
worksheet.set_column('C:C', 20)
worksheet.set_column('D:D', 20)
worksheet.write('A1', '名稱')
worksheet.write('B1', '價格')
worksheet.write('C1', '地區')
worksheet.write('D1', '付款人數')
for url in urls:
html = GetHtml(url)
s = GetxxintoExcel(html.text)
time.sleep(5)
workbook.close()#在程式結束之前不要打開excel,excel表在當前目錄下
五:完整代碼
import re
import requests
import xlsxwriter
import time
def GetxxintoExcel(html):
global count
a = re.findall(r'"raw_title":"(.*?)"', html)
b = re.findall(r'"view_price":"(.*?)"', html)
c = re.findall(r'"item_loc":"(.*?)"', html)
d = re.findall(r'"view_sales":"(.*?)"', html)
x = []
for i in range(len(a)):
try:
x.append((a[i],b[i],c[i],d[i]))
except IndexError:
break
i = 0
for i in range(len(x)):
worksheet.write(count + i + 1, 0, x[i][0])
worksheet.write(count + i + 1, 1, x[i][1])
worksheet.write(count + i + 1, 2, x[i][2])
worksheet.write(count + i + 1, 3, x[i][3])
count = count +len(x)
return print("已完成")
def Geturls(q, x):
url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \
"=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "
urls = []
urls.append(url)
if x == 1:
return urls
for i in range(1, x ):
url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \
"&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \
"&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str(
i * 44)
urls.append(url)
return urls
def GetHtml(url):
r = requests.get(url,headers =headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r
if __name__ == "__main__":
count = 0
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
,"cookie":""
}
q = input("輸入貨物")
x = int(input("你想爬取幾頁"))
urls = Geturls(q,x)
workbook = xlsxwriter.Workbook(q+".xlsx")
worksheet = workbook.add_worksheet()
worksheet.set_column('A:A', 70)
worksheet.set_column('B:B', 20)
worksheet.set_column('C:C', 20)
worksheet.set_column('D:D', 20)
worksheet.write('A1', '名稱')
worksheet.write('B1', '價格')
worksheet.write('C1', '地區')
worksheet.write('D1', '付款人數')
xx = []
for url in urls:
html = GetHtml(url)
s = GetxxintoExcel(html.text)
time.sleep(5)
workbook.close()
最后覺得寫的可以的

轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/203206.html
標籤:其他
上一篇:《假如編程是魔法之零基礎看得懂的Python入門教程 》——(三)使用初始魔法跟編程魔法世界打個招呼吧
下一篇:網路故障監測終端
