前言

本文簡單使用python的requests庫及re正則運算式對淘寶的商品資訊（商品名稱，商品價格，生產地區，以及銷售額）進行了爬取，并最后用xlsxwriter庫將資訊放入Excel表格，最后的效果圖如下：
在這里插入圖片描述

提示：以下是本篇文章正文內容

一、決議淘寶URL組成

1.我們的第一個需求就是要輸入商品名字回傳對應的資訊
所以我們這里隨便選一個商品來觀察它的URL，這里我們選擇的是書包，打開網頁，可知他的URL為：
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306
可能單單從這個url里我們看不出什么，但是我們可以從圖中看出一些端倪

在這里插入圖片描述
我們發現q后面的引數就是我們要獲取的物品的名字
2.我們第二個需求就是根據輸入的數字來爬取商品的頁碼
所以我們來觀察一下后面幾頁URL的組成

由此我們可以得出分頁的依據是最后s的值=（44（頁數-1））

二、查看網頁原始碼并用re庫提取資訊

1.查看原始碼

在這里插入圖片描述
這里的幾個資訊都是我們所需要的

2.re庫提取資訊

	a = re.findall(r'"raw_title":"(.*?)"', html)
    b = re.findall(r'"view_price":"(.*?)"', html)
    c = re.findall(r'"item_loc":"(.*?)"', html)
    d = re.findall(r'"view_sales":"(.*?)"', html)

三：函式填寫

這里我寫了三個函式，第一個函式來獲取html網頁，代碼如下：

def GetHtml(url):
    r = requests.get(url,headers =headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    return r

第二個用于獲取網頁的URL代碼如下：

def Geturls(q, x):
    url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \
                                                 "=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "
    urls = []
    urls.append(url)
    if x == 1:
        return urls
    for i in range(1, x ):
        url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \
              "&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \
              "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str(
            i * 44)
        urls.append(url)
    return urls

第三個用于獲取我們需要的商品資訊并寫入Excel表格代碼如下：

def GetxxintoExcel(html):
    global count#定義一個全域變數count用于后面excel表的填寫
    a = re.findall(r'"raw_title":"(.*?)"', html)#（.*?）匹配任意字符
    b = re.findall(r'"view_price":"(.*?)"', html)
    c = re.findall(r'"item_loc":"(.*?)"', html)
    d = re.findall(r'"view_sales":"(.*?)"', html)
    x = []
    for i in range(len(a)):
        try:
            x.append((a[i],b[i],c[i],d[i]))#把獲取的資訊放入新的串列中
        except IndexError:
            break
    i = 0
    for i in range(len(x)):
        worksheet.write(count + i + 1, 0, x[i][0])#worksheet.write方法用于寫入資料,第一個數字是行位置，第二個數字是列，第三個是寫入的資料資訊，
        worksheet.write(count + i + 1, 1, x[i][1])
        worksheet.write(count + i + 1, 2, x[i][2])
        worksheet.write(count + i + 1, 3, x[i][3])
    count = count +len(x) #下次寫入的行數是這次的長度+1
    return print("已完成")

四：主函式填寫

if __name__ == "__main__":
    count = 0
    headers = {
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
        ,"cookie":""#cookie 是每個人獨有的，因為反爬機制的緣故，爬取太快可能到后面要重新重繪一下自己的Cookie，
                }
    q = input("輸入貨物")
    x = int(input("你想爬取幾頁"))
    urls = Geturls(q,x)
    workbook = xlsxwriter.Workbook(q+".xlsx")
    worksheet = workbook.add_worksheet()
    worksheet.set_column('A:A', 70)
    worksheet.set_column('B:B', 20)
    worksheet.set_column('C:C', 20)
    worksheet.set_column('D:D', 20)
    worksheet.write('A1', '名稱')
    worksheet.write('B1', '價格')
    worksheet.write('C1', '地區')
    worksheet.write('D1', '付款人數')
    for url in urls:
        html = GetHtml(url)
        s = GetxxintoExcel(html.text)
        time.sleep(5)
    workbook.close()#在程式結束之前不要打開excel，excel表在當前目錄下

五：完整代碼

import re
import  requests
import xlsxwriter
import  time

def GetxxintoExcel(html):
    global count
    a = re.findall(r'"raw_title":"(.*?)"', html)
    b = re.findall(r'"view_price":"(.*?)"', html)
    c = re.findall(r'"item_loc":"(.*?)"', html)
    d = re.findall(r'"view_sales":"(.*?)"', html)
    x = []
    for i in range(len(a)):
        try:
            x.append((a[i],b[i],c[i],d[i]))
        except IndexError:
            break
    i = 0
    for i in range(len(x)):
        worksheet.write(count + i + 1, 0, x[i][0])
        worksheet.write(count + i + 1, 1, x[i][1])
        worksheet.write(count + i + 1, 2, x[i][2])
        worksheet.write(count + i + 1, 3, x[i][3])
    count = count +len(x)
    return print("已完成")


def Geturls(q, x):
    url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \
                                                 "=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 "
    urls = []
    urls.append(url)
    if x == 1:
        return urls
    for i in range(1, x ):
        url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \
              "&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \
              "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str(
            i * 44)
        urls.append(url)
    return urls


def GetHtml(url):
    r = requests.get(url,headers =headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    return r

if __name__ == "__main__":
    count = 0
    headers = {
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
        ,"cookie":""
                }
    q = input("輸入貨物")
    x = int(input("你想爬取幾頁"))
    urls = Geturls(q,x)
    workbook = xlsxwriter.Workbook(q+".xlsx")
    worksheet = workbook.add_worksheet()
    worksheet.set_column('A:A', 70)
    worksheet.set_column('B:B', 20)
    worksheet.set_column('C:C', 20)
    worksheet.set_column('D:D', 20)
    worksheet.write('A1', '名稱')
    worksheet.write('B1', '價格')
    worksheet.write('C1', '地區')
    worksheet.write('D1', '付款人數')
    xx = []
    for url in urls:
        html = GetHtml(url)
        s = GetxxintoExcel(html.text)
        time.sleep(5)
    workbook.close()

最后覺得寫的可以的
在這里插入圖片描述

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/203206.html

標籤：其他

上一篇：《假如編程是魔法之零基礎看得懂的Python入門教程》——（三）使用初始魔法跟編程魔法世界打個招呼吧

下一篇：網路故障監測終端

python爬蟲實戰——爬取淘寶商品資訊并匯入EXCEL表格（超詳細）

文章目錄