我創建了一個管道，將所有抓取的資料放入 sqlite 資料庫，但我的蜘蛛沒有完成分頁。這就是我在蜘蛛關閉時得到的結果。我應該得到大約 45k 結果，而我只得到 420。這可能是為什么？

2021-12-06 14:47:55 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-06 14:47:55 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60891/session/d441b41f-b62b-4c64-a5ef-68329c18dd4e {}
2021-12-06 14:47:56 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60891 "DELETE /session/d441b41f-b62b-4c64-a5ef-68329c18dd4e HTTP/1.1" 200 14
2021-12-06 14:47:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-06 14:47:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 7510132,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 15,
 'elapsed_time_seconds': 89.469538,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 6, 20, 47, 55, 551566),
 'item_scraped_count': 420,
 'log_count/DEBUG': 577,
 'log_count/INFO': 11,
 'request_depth_max': 14,
 'response_received_count': 15,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2021, 12, 6, 20, 46, 26, 82028)}
2021-12-06 14:47:56 [scrapy.core.engine] INFO: Spider closed (finished)

這是我的蜘蛛：

import scrapy
from scrapy_selenium import SeleniumRequest

class HomesSpider(scrapy.Spider):
name = 'homes'

def remove_characters(self,value):
    return value.strip(' m2')

def start_requests(self):
    yield SeleniumRequest(
        url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1',
        wait_time=3,
        callback=self.parse
    )

def parse(self, response):
    homes = response.xpath("//div[@id='tileRedesign']/div")
    for home in homes:
        yield {
            'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(), 
            'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(), 
            'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
            'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(), 
            'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
            'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
            'link':home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
        }
        
    next_page = response.xpath("//a[@class='icon-pagination-right']/@href").get()
    if next_page:
        absolute_url = f"https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1{next_page}"
        yield SeleniumRequest(
            url=absolute_url,
            wait_time=3,
            callback=self.parse,
            dont_filter = True
        )

這可能與我的 user_agent 明確相關嗎我已經將它分配給 settings.py 還是我被禁止進入這個頁面？網頁的 html 也完全沒有變化。

謝謝。

uj5u.com熱心網友回復：

您的代碼按您的預期作業正常，問題出在分頁部分，我在起始 url 中進行了分頁，哪種型別的分頁總是準確的，并且比下一頁快兩倍多。有 50 頁，抓取的專案總數為 1400

腳本

import scrapy
from scrapy_selenium import SeleniumRequest


class HomesSpider(scrapy.Spider):
    name = 'homes'
    def remove_characters(self, value):
        return value.strip(' m2')

    def start_requests(self):
        urls=[f'https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-{i}/v1c1097l1021p50'.format(i) for i in range(1,51)]
        for url in urls:
            yield SeleniumRequest(
                url=url,
                wait_time=5,
                callback=self.parse
                )

    def parse(self, response):
        homes = response.xpath("//div[@id='tileRedesign']/div")
        for home in homes:
            yield {
                'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(),
                'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
                'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
                'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(),
                'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
                'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
                'link': home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
            }

輸出

{'price': '$3,520,664', 'location': 'Santiago de Querétaro', 'description': 'Paso de los Toros Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '151', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '$4,690,000', 'location': 'El Refugio', 'description': 'Ria?o Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}      
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-rincones-marques/5d6951eee4b05e9aaae12de6'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '$4,690,000', 'location': 'El Refugio', 'description': 'Ria?o Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}      
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:65206/session/1487a9ea1c9752794aad497613552337 {}
2021-12-07 06:06:33 [urllib3.connectionpool] DEBUG: http://127.0.0.1:65206 "DELETE /session/1487a9ea1c9752794aad497613552337 HTTP/1.1" 200 14
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-07 06:06:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 23589849,
 'downloader/response_count': 50,
 'downloader/response_status_count/200': 50,
 'elapsed_time_seconds': 150.933428,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 7, 0, 6, 33, 111357),
 'item_scraped_count': 1400,

.. 很快

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/376123.html

標籤：硒网络驱动程序网页抓取刮的网页抓取语言硒

上一篇：我無法將硒切換到iframe

下一篇：如何檢查元素并找到正確的元素？我正在使用Python和Selenium

Spider關閉時沒有錯誤訊息，并且不會抓取分頁中的所有頁面（SELENIUM）

腳本

輸出