Scrapy：決議來自多個頁面（分頁）的資料并將產量輸出組合在單個陣列中-有解無憂

我想要做的是抓取多個頁面并在單個陣列中產生結果。

到目前為止我已經嘗試過：

import scrapy


class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["realtor.com"]
    start_urls = ["http://realtor.com/"]

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Sec-GPC": "1",
        "Connection": "keep-alive",
        "If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm uEgs"',
        "Cache-Control": "max-age=0",
        "TE": "trailers",
    }

    def start_requests(self):
        url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"

        for page in range(1, 4):
            next_page = url   "/pg-"   str(page)
            yield scrapy.Request(
                url=next_page, headers=self.headers, callback=self.parse, priority=1
            )

    def parse(self, response):
        # extract data
        for card in response.css("ul.property-list"):
            item = {"price": card.css("span[data-label=pc-price]::text").getall()}
            yield item

這給了我三個單獨的價格清單。

['$740,000', '$998,000', '$620,000', ......, '$719,000', '$2,975,000', '$1,099,000']
['$500,000', '$474,000', '$725,000', ......, '$895,000', '$619,500', '$1,199,000']
['$1,095,000', '$475,000', '$700,000', ........, '$950,000', '$995,000', '$639,950']

我正在尋找的是得到一個這樣的串列：

$740,000 - 1
$998,000 - 2
$620,000 - 3
$719,000 - 4
     .
     .
     .
$995,000 - 143
$639,950 - 144

uj5u.com熱心網友回復：

我不確定究竟是什么導致了示例串列，但假設您呼叫了RealtorSpider實際導致獲得三個串列的函式之一。由于這些函式用于yield回傳值，因此您可能需要list在這些函式的輸出上呼叫以獲得串列而不是generator.

我建議您編輯您的realtor.py檔案，如下所示：

import scrapy
import json

class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["realtor.com"]
    start_urls = ["http://realtor.com/"]
    prices = []
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Sec-GPC": "1",
        "Connection": "keep-alive",
        "If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm uEgs"',
        "Cache-Control": "max-age=0",
        "TE": "trailers",
    }

    def start_requests(self):
        url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"

        for page in range(1, 4):
            next_page = url   "/pg-"   str(page)
            yield scrapy.Request(
                url=next_page, headers=self.headers, callback=self.parse, priority=1
            )

    def parse(self, response):
        # extract data
        for card in response.css("ul.property-list"):
            item = {"price": card.css("span[data-label=pc-price]::text").getall()}
            self.prices.append(item["price"])
            yield item
        data = [x for y in self.prices for x in y]
        with open("data.json", "w") as f:
          f.write(json.dumps(data))

如果將檔案編輯到這個檔案中，scrapy crawl realtor在 shell 中運行后會生成一個名為data.json. 這個檔案正是你想要的。因此，您可以閱讀它：

import json
data = json.load(open("data.json"))
data

輸出

['$575,000',
 '$399,950',
 '$620,000',
 '$1,150,000',
 '$1,100,000',
 '$880,000',
 '$735,000',
 '$337,000',
 '$759,800',
 '$330,000',
 '$575,000',
 '$740,000',
 '$639,950',
 '$950,000',
 '$575,000',
 '$895,000',
 '$950,000',
 '$675,000',
 '$629,000',
 '$2,000,000',
 '$1,325,000',
 '$714,900',
 '$699,950',
 '$998,000',
 '$1,150,000',
 '$849,999',
 '$999,000',
 '$1,050,000',
 '$750,000',
 '$2,975,000',
 '$1,300,000',
 '$1,350,000',
 '$400,000',
 '$1,349,000',
 '$1,175,000',
 '$1,049,000',
 '$3,500,000',
 '$849,000',
 '$719,000',
 '$734,950',
 '$1,099,000',
 '$769,000',
 '$489,000',
 '$1,095,000',
 '$700,000',
 '$475,000',
 '$450,000',
 '$625,000',
 '$330,000',
 '$425,000',
 '$685,000',
 '$385,000',
 '$649,950',
 '$815,000',
 '$699,000',
 '$525,000',
 '$1,495,000',
 '$325,000',
 '$835,000',
 '$599,950',
 '$1,150,000',
 '$895,000',
 '$998,900',
 '$775,000',
 '$565,000',
 '$750,000',
 '$879,000',
 '$325,000',
 '$1,000,000',
 '$785,000',
 '$725,000',
 '$899,000',
 '$1,095,000',
 '$1,175,000',
 '$815,000',
 '$2,300,000',
 '$950,000',
 '$929,000',
 '$1,249,900',
 '$1,650,000',
 '$1,500,000',
 '$639,950',
 '$995,000',
 '$750,000',
 '$630,000',
 '$999,000',
 '$474,000',
 '$390,000',
 '$485,000',
 '$725,000',
 '$500,000',
 '$340,000',
 '$689,000',
 '$525,000',
 '$650,000',
 '$589,950',
 '$665,000',
 '$725,000',
 '$460,000',
 '$749,450',
 '$1,088,000',
 '$525,000',
 '$495,000',
 '$830,000',
 '$475,000',
 '$999,000',
 '$849,950',
 '$848,000',
 '$480,000',
 '$538,000',
 '$4,585,000',
 '$1,150,000',
 '$1,045,000',
 '$730,000',
 '$630,000',
 '$1,950,000',
 '$899,000',
 '$1,975,000',
 '$1,179,500',
 '$2,100,000',
 '$829,000',
 '$2,750,000',
 '$895,000',
 '$849,950',
 '$619,500',
 '$1,199,000']

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/441710.html

標籤：Python 刮擦

上一篇：如何在python中使用全域顏色條繪制多個散點圖？

下一篇：檔案打開成功時如何處理關閉函式