前言

本文的文字及圖片來源于網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理，

Python爬蟲、資料分析、網站開發等案例教程視頻免費在線觀看

https://space.bilibili.com/523606542

前文內容

Python爬蟲新手入門教學（一）：爬取豆瓣電影排行資訊

Python爬蟲新手入門教學（二）：爬取小說

基本開發環境

Python 3.6
Pycharm

一、明確需求

爬取圖上所框的內容

二、請求網頁

打開開發者工具（ F12或者滑鼠右鍵點擊檢查）選擇 notework 查看資料回傳的內容，

通過開發者工具可以看到，網站是靜態網頁資料，請求url地址是可以直接獲取資料內容的，

url = 'https://cs.lianjia.com/ershoufang/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
print(response.text)

如果你不知道，回傳的資料中是否有你想要的內容，你有復制網頁的內容，在pycharm的輸出結果中進行搜索查看，

三、決議資料

既然網站是靜態網頁資料，那么就可以直接在開發者工具中 Elements 查看資料在哪

如上圖所示，相關的資料內容都包含在 li 標簽里面，通過 parsel 決議庫，進行決議提取資料就可以了，

selector = parsel.Selector(response.text)
lis = selector.css('.sellListContent li')
for li in lis:
    # 標題
    title = li.css('.title a::text').get()
    # 地址
    positionInfo = li.css('.positionInfo a::text').getall()
    # 小區
    community = positionInfo[0]
    # 地名
    address = positionInfo[1]
    # 房子基本資訊
    houseInfo = li.css('.houseInfo::text').get()
    # 房價
    Price = li.css('.totalPrice span::text').get() + '萬'
    # 單價
    unitPrice = li.css('.unitPrice span::text').get().replace('單價', '')
    # 發布資訊
    followInfo = li.css('.followInfo::text').get()
    dit = {
        '標題': title,
        '小區': community,
        '地名': address,
        '房子基本資訊': houseInfo,
        '房價': Price,
        '單價': unitPrice,
        '發布資訊': followInfo,
    }
    print(dit)

當我運行的時候發現報錯了，

IndexError: list index out of range 超出索引范圍了，
遇事不要慌，取0超出索引范圍，說明資料并沒有取到，所以我們要看一下 <精裝好房...> 這個資訊下面那一個是什么情況，

搜索發現，這個中間插入了一條廣告，也是li標簽里面的，所以做一個簡單的判斷就好了，它是一個廣告并沒有標題，判斷是否有標題就可以了，有就爬取相關內容，沒有就pass掉，

for li in lis:
    # 標題
    title = li.css('.title a::text').get()
    if title:
        # 地址
        positionInfo = li.css('.positionInfo a::text').getall()
        # 小區
        community = positionInfo[0]
        # 地名
        address = positionInfo[1]
        # 房子基本資訊
        houseInfo = li.css('.houseInfo::text').get()
        # 房價
        Price = li.css('.totalPrice span::text').get() + '萬'
        # 單價
        unitPrice = li.css('.unitPrice span::text').get().replace('單價', '')
        # 發布資訊
        followInfo = li.css('.followInfo::text').get()
        dit = {
            '標題': title,
            '小區': community,
            '地名': address,
            '房子基本資訊': houseInfo,
            '房價': Price,
            '單價': unitPrice,
            '發布資訊': followInfo,
        }
        print(dit)

這樣就不會報錯了，

四、保存資料（資料持久化）

和爬取豆瓣的電影資訊是一樣的，使用csv模塊，把資料保存到Excel里面

# 創建檔案
f = open('二手房資料.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['標題', '小區', '地名', '房子基本資訊',
                                           '房價', '單價', '發布資訊'])
# 寫入表頭
csv_writer.writeheader()
''''
''''
csv_writer.writerow(dit)

五、多頁爬取

# 第二頁url地址
url_2 = 'https://cs.lianjia.com/ershoufang/pg2/'
# 第三頁url地址
url_3 = 'https://cs.lianjia.com/ershoufang/pg3/'
# 第四頁url地址
url_4 = 'https://cs.lianjia.com/ershoufang/pg4/'

通過以上的內容，只需要for 回圈遍歷 pg的引數即可多頁爬取

for page in range(1, 101):
    url = f'https://cs.lianjia.com/ershoufang/pg{page}/'

這樣就可以進行多頁爬取了，

實作效果

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/252467.html

標籤：Python

上一篇：python的運算子

下一篇：Python爬蟲入門教程07：騰訊視頻彈幕爬取

Python爬蟲新手入門教學（三）：爬取鏈家二手房資料

前言