一、用到技術

python 基礎
xlsxwriter 用來寫入excel檔案的
urllib python內置爬蟲工具
BeautifulSoup決議提取資料

二、目標頁面

https://tieba.baidu.com/f?kw=%E6%97%85%E6%B8%B8&ie=utf-8&pn=0

三、結果

四、安裝必要的庫

win+R 打開運行
輸出cmd 進入控制臺
分別安裝beautifulsoup4,lxml,xlsxwriter

pip install   lxml
pip install   beautifulsoup4
pip install   xlsxwriter

五、分析頁面

1. 頁面規律

我們單擊分頁按鈕，拿到頁面最后一個引數的規律
第二頁：https://tieba.baidu.com/f?kw=旅游&ie=utf-8&pn= 50
第三頁：https://tieba.baidu.com/f?kw=旅游&ie=utf-8&pn= 100
第四頁：https://tieba.baidu.com/f?kw=旅游&ie=utf-8&pn= 150

2. 頁面資訊

旅游資訊串列
打開網頁https://tieba.baidu.com/f?kw=旅游&ie=utf-8&pn= 50
按鍵盤F12鍵或者滑鼠右鍵"檢查元素"（我用的谷歌chrome瀏覽器）

發現所有旅游串列都有個共同的class類名j_thread_list
串列分析

作者與創建時間
作者的class為frs-author-name,創建時間的class為is_show_create_time
作者與用戶名分析

標題
標題的class為j_th_tit
標題分析

六、全部代碼

import xlsxwriter
# 用來寫入excel檔案的
import urllib.parse
# URL編碼格式轉換的
import urllib.request
# 發起http請求的
from bs4 import BeautifulSoup
# css方法決議提取資訊

url='https://tieba.baidu.com/f?kw='+urllib.parse.quote('旅游')+'&ie=utf-8&pn='
# 百度貼吧旅游資訊
# parse.quote("旅游") # 結果為%E6%97%85%E6%B8%B8

herders={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36', 'Referer':'https://tieba.baidu.com/','Connection':'keep-alive'}
# 請求頭資訊

data = []
# 所有爬蟲的資料都存放到 這個data串列里面

"""
getList 獲取分頁中的串列資訊
url   分頁地址
"""
def getList(url):

    req = urllib.request.Request(url,headers=herders)
    # 設定請求頭
    response=urllib.request.urlopen(req)
    # 發起請求得到 回應結果response

    htmlText = response.read().decode("utf-8").replace("<!--","").replace("-->","")
    # htmlText = 回應結果read讀取.decode 轉換為utf文本.replace 替換掉html中的注釋
    # 我們需要的結果在注釋中，所以要先替換掉注釋標簽 <!-- -->

    html = BeautifulSoup(htmlText,"lxml")
    # 創建beautifulSoup物件

    thread_list=html.select(".j_thread_list")
    # 獲取到所有的旅游類別


    # 遍歷旅游串列
    for thread in thread_list:
        title = thread.select(".j_th_tit")[0].get_text()
        author = thread.select(".frs-author-name")[0].get_text()
        time= thread.select(".is_show_create_time")[0].get_text()
        # 提取標題，作者，事件
        print(title) # 列印標簽
        data.append([title,author,time])
        # 追加到總資料中

"""
獲取到所有的分頁地址，最大5頁
url 頁面地址
p=5 最多5頁
"""
def getPage(url,p=5):
    for i in range(5):
        link = url+str(i*50)
        # 再一次拼接 第1頁0  第2頁50 第3頁100 第4頁150
        getList(link)
        # 執行獲取頁面函式getList

"""
寫入excel檔案
data 被寫入的資料
"""
def writeExecl(data):
    lens = len(data)
    # 獲取頁面的長度
    workbook = xlsxwriter.Workbook('travel.xlsx')
    # 創建一個excel檔案
    sheet = workbook.add_worksheet()
    # 添加一張作業表
    sheet.write_row("A1",["標題","作者","時間"])
    # 寫入一行標題
    for i in range(2, lens + 2):
        sheet.write_row("A"+str(i),data[i - 2])
    # 遍歷data 寫入行資料到excel
    workbook.close()
    # 關閉excel檔案
    print("xlsx格式表格寫入資料成功！")

"""
定義主函式
"""
def main():
    getPage(url,5) #獲取分頁
    writeExecl(data) #寫入資料到excel

# 如果到模塊的名字是__main__ 執行main主函式
if __name__ == '__main__':
    main()

七、單詞表


main        主要的
def         (define) 定義
getPage     獲取頁面
writeExcel  寫入excel
workbook    作業簿
sheet       表
write_row   寫入行
add         添加
close       關閉
len         length長度
data        資料
range       范圍
str         （string）字串
append      追加
author      作者
select      選擇
Beautiful   美麗
Soup        糖
herders     頭資訊
response    回應
read        讀
decode      編碼
Request     請求
parse       決議
quote       參考

在線練習：https://www.520mg.com/it
IT 入門感謝關注
IT 入門感謝關注

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/167861.html

標籤：其他

上一篇：Python---面向物件學習總結

下一篇：Python爬取京東商城商品大圖詳解

15-python爬取百度貼吧-excel存盤

一 、用到技術

二、 目標頁面