如何處理大規模的網頁抓取？-有解無憂

情況：

我最近開始使用 selenium 和 scrapy 進行網路抓取，我正在做一個專案，其中我有一個包含 42,000 個郵政編碼的 csv 檔案，我的作業是獲取該郵政編碼并在此站點上輸入郵政編碼并抓取所有結果。

問題：

這里的問題是，在執行此操作時，我必須連續單擊“加載更多”按鈕，直到顯示所有結果，并且只有在顯示完成后才能收集資料。

這可能不是什么大問題，但是每個郵政編碼需要 2 分鐘來執行此操作，而我有 42 000 來執行此操作。

編碼：

    import scrapy
    from numpy.lib.npyio import load
    from selenium import webdriver
    from selenium.common.exceptions import ElementClickInterceptedException, ElementNotInteractableException, ElementNotSelectableException, NoSuchElementException, StaleElementReferenceException
    from selenium.webdriver.common.keys import Keys
    from items import CareCreditItem
    from datetime import datetime
    import os
    
    
    from scrapy.crawler import CrawlerProcess
    global pin_code
    pin_code = input("enter pin code")
    
    class CareCredit1Spider(scrapy.Spider):
        
        name = 'care_credit_1'
        start_urls = ['https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty//?Sort=D&Radius=75&Page=1']
    
        def start_requests(self):
            
            directory = os.getcwd()
            options = webdriver.ChromeOptions()
            options.headless = True
    
            options.add_experimental_option("excludeSwitches", ["enable-logging"])
            path = (directory r"\\Chromedriver.exe")
            driver = webdriver.Chrome(path,options=options)
    
            #URL of the website
            url = "https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty/"  pin_code   "/?Sort=D&Radius=75&Page=1"
            driver.maximize_window()
            #opening link in the browser
            driver.get(url)
            driver.implicitly_wait(200)
            
            try:
                cookies = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
                cookies.click()
            except:
                pass
    
            i = 0
            loadMoreButtonExists = True
            while loadMoreButtonExists:
                try:
                    load_more =  driver.find_element_by_xpath('//*[@id="next-page"]')
                    load_more.click()    
                    driver.implicitly_wait(30)
                except ElementNotInteractableException:
                    loadMoreButtonExists = False
                except ElementClickInterceptedException:
                    pass
                except StaleElementReferenceException:
                    pass
                except NoSuchElementException:
                    loadMoreButtonExists = False
    
            try:
                previous_page = driver.find_element_by_xpath('//*[@id="previous-page"]')
                previous_page.click()
            except:
                pass
    
            name = driver.find_elements_by_class_name('dl-result-item')
            r = 1
            temp_list=[]
            j = 0
            for element in name:
                link = element.find_element_by_tag_name('a')
                c = link.get_property('href')
                yield scrapy.Request(c)
    
        def parse(self, response):
            item = CareCreditItem()
            item['Practise_name'] = response.css('h1 ::text').get()
            item['address'] = response.css('.google-maps-external ::text').get()
            item['phone_no'] = response.css('.dl-detail-phone ::text').get()
            yield item
    now = datetime.now()
    dt_string = now.strftime("%d/%m/%Y")
    dt = now.strftime("%H-%M-%S")
    file_name = dt_string "_" dt "zip-code" pin_code ".csv"
    process = CrawlerProcess(settings={
        'FEED_URI' : file_name,
        'FEED_FORMAT':'csv'
    })
    process.crawl(CareCredit1Spider)
    process.start()
    print("CSV File is Ready")

專案.py


    import scrapy

    class CareCreditItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        Practise_name = scrapy.Field()
        address = scrapy.Field()
        phone_no = scrapy.Field()

問題：

基本上我的問題很簡單。有沒有辦法優化此代碼以使其執行速度更快？或者還有哪些其他潛在的方法可以處理抓取這些資料而不會永遠花費時間？

uj5u.com熱心網友回復：

由于站點從api動態加載資料，因此您可以直接從 api 檢索資料。這將大大加快速度，但我仍然會實施等待以避免達到速率限制。

import requests
import time
import pandas as pd

zipcode = '00704'
radius = 75
url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page=1'
req = requests.get(url)
r = req.json()
data = r['results']

for i in range(2,r['maxPage'] 1):
    url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page={i}'
    req = requests.get(url)
    r = req.json()
    data.extend(r['results'])
    time.sleep(1)

df = pd.DataFrame(data)
df.to_csv(f'{pd.Timestamp.now().strftime("%d/%m/%Y_%H-%M-%S")}zip-code{zipcode}.csv')

uj5u.com熱心網友回復：

有多種方法可以做到這一點。

1. 創建一個分布式系統，在其中通過多臺機器運行蜘蛛以并行運行。

在我看來，這是更好的選擇，因為您還可以創建一個可擴展的動態解決方案，您將能夠多次使用它。

通常有很多方法可以做到這一點，它將包括將種子串列（郵政編碼）劃分為許多單獨的種子串列，以便讓單獨的行程處理單獨的種子串列，因此下載將并行運行，例如，如果它在 2機器它會快 2 倍，但如果在 10 臺機器上它會快 10 倍，等等。

為了做到這一點，我可能建議查看 AWS，即AWS Lambda、AWS EC2 實體甚至AWS Spot 實體，這些是我以前使用過的實體，并且它們并不難使用。

2. 或者，如果您想在單臺機器上運行它，您可以查看Multithreading with Python，它可以幫助您在單臺機器上并行運行該行程。

3. 這是另一種選擇，特別是如果它是一次性程序。您可以嘗試簡單地使用可以加快速度的請求來運行它，但是如果使用大量種子，則開發并行運行的行程通常會更快。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/338302.html

標籤：Python selenium web-scraping scrapy

上一篇：將一組資料（URLS）放入一個空的資料幀PythonPandas

下一篇：收集瀏覽器中顯示但不回應的資料