使用分頁進行網頁抓取不會回傳所有結果-有解無憂

我正在嘗試抓取 Indeed.com，但遇到分頁問題。這是我的代碼：

import scrapy
class JobsNySpider(scrapy.Spider):
    name = 'jobs_ny'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New York, NY&vjk=7b2f6385304ffc78']

    def parse(self, response):
        jobs = response.xpath("//td[@id='resultsCol']")
        for job in jobs:
            yield {
                'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
                'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
                'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
                'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
                'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
        }

        next_page = response.urljoin(response.xpath("//a[@aria-label='Next']/@href").get())

        if next_page:
           yield scrapy.Request(url=next_page, callback=self.parse)

問題是，根據 Indeed 的資料，有 28,789 個職位符合我的查詢。但是，當我將抓取的內容保存到 csv 檔案時，只有 76 行。我也試過： next_page = response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/@href").get() ) 但結果相似。所以我的問題是我在處理分頁時做錯了什么。

uj5u.com熱心網友回復：

問題不在于分頁，而在于您只能從每一頁獲得一份作業。
最好urljoin在 if 陳述句之后執行以避免錯誤。

import scrapy


class JobsNySpider(scrapy.Spider):
    name = 'jobs_ny'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New York, NY&vjk=7b2f6385304ffc78']

    def parse(self, response):
        jobs = response.xpath('//div[@id="mosaic-provider-jobcards"]/a')
        for job in jobs:
            yield {
                'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
                'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
                'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
                'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
                'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
            }

        next_page = response.xpath("//a[@aria-label='Next']/@href").get()

        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url=next_page, callback=self.parse)

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/424251.html

標籤：网页抓取分页刮擦

上一篇：我可以使用selenium在MicrosoftPowerBi儀表板中設定日期嗎？

下一篇：使用pythonselenium獲取尺寸