CrawlSpider為什么不收集鏈接？-有解無憂

我正在嘗試運行我的第一個 CrawlSpider，但程式終止時沒有任何錯誤，雖然它沒有回傳任何內容，但它以零結果終止。我的代碼有什么問題？

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class FagorelectrodomesticoSpider(CrawlSpider):
    name = 'fagorelectrodomestico.com'
    allowed_domains = ['fagorelectrodomestico.com']
    start_urls = ['https://fagorelectrodomestico.com']

rules = (
    Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)

def parse_item(self, response):
    for doc in response.css('a.file'):
        doclink = doc.css('::attr("href")').get()
        product = Product()
        product['model'] = response.css('h2.data__symbol::text').get()
        product['brand'] = 'Fagor'
        product['file_urls'] = [doclink]
        yield product

uj5u.com熱心網友回復：

主要問題是此頁面用于JavaScript將所有元素添加到 HTML 但Scrapy無法運行JavaScript。如果您JavaScript在瀏覽器中關閉并重新加載此頁面，那么您應該會看到空白頁面。但是有一個模塊scrapy_selenium可以使用模塊Selenium來控制可以運行的真實Web瀏覽器JavaScript（但它會運行得更慢）。

其他問題：您的規則搜索鏈接product/我在主頁上看不到，但我可以在帶有類別的頁面上看到。但是你不需要規則來加載其他頁面，它不能product/ 從子頁面獲取鏈接- 所以它需要另一個規則來獲取其他鏈接并發送到回呼parser（在Spider加載頁面中，搜索所有鏈接并檢查這些鏈接的規則） .

它可能需要添加/en/到起始 url 以獲得與product/. 西班牙語版有鏈接productos/。

需要使用一些代碼SeleniumRequest而不是標準Request代碼 - 我從CrawlSpider 的源代碼中獲取了一些代碼并添加了一些更改。

我也曾經CrawlerProcess在不創建專案的情況下運行代碼 - 所以每個人都可以簡單地復制并運行python script.py

它將檔案下載到檔案夾full。

我只測驗了沒有選項-headless以查看它在瀏覽器中的內容。您可能需要進行測驗，-headless因為它可能作業得更快，但有時它會有所不同。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import scrapy_selenium

class FagorelectrodomesticoSpider(CrawlSpider):

    name = 'fagorelectrodomestico.com'

    allowed_domains = ['fagorelectrodomestico.com']
    start_urls = ['https://fagorelectrodomestico.com/en/']

    rules = (
        Rule(LinkExtractor(allow='/en/product/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow='/en/', deny='/en/product/'), callback='parse', follow=True),
    )

    def start_requests(self):
        print('[start_requests]')
        for url in self.start_urls:
            print('[start_requests] url:', url)            
            yield scrapy_selenium.SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        print('[parse] url:', response.url)
        
        for rule_index, rule in enumerate(self._rules):
            #print(rule.callback)
            for link in rule.link_extractor.extract_links(response):
                yield scrapy_selenium.SeleniumRequest(
                    url=link.url,
                    callback=rule.callback,
                    errback=rule.errback,
                    meta=dict(rule=rule_index, link_text=link.text),
                )
            
    def parse_item(self, response):
        print('[parse_item] url:', response.url)
        
        for doc in response.css('a.file'):
            doclink = doc.css('::attr("href")').get()
            product = {
                'model': response.css('h2.data__symbol::text').get(),
                'brand': 'Fagor',
                'file_urls': [doclink],
            }
            yield product
        

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},   # used standard FilesPipeline (download to FILES_STORE/full)
    #'FILES_STORE': '/path/to/valid/dir',  # this folder has to exist before downloading
    'FILES_STORE': '.',                   # this folder has to exist before downloading

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    #'SELENIUM_DRIVER_ARGUMENTS': ['-headless'], # '--headless' if using chrome instead of firefox
    'SELENIUM_DRIVER_ARGUMENTS': [],
    #'SELENIUM_BROWSER_EXECUTABLE_PATH': '',
    #'SELENIUM_COMMAND_EXECUTOR': '',
    
    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800}
})
c.crawl(FagorelectrodomesticoSpider)
c.start()

uj5u.com熱心網友回復：

從閱讀檔案看來，這一行可能是問題所在：

rules = (
    Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)

檔案說：

callback 是一個可呼叫物件或字串（在這種情況下，將使用來自具有該名稱的蜘蛛物件的方法）為使用指定鏈接提取器提取的每個鏈接呼叫。

您parse_item是可呼叫的，而不是來自蜘蛛物件的方法。因此，我認為您應該將其作為可呼叫物件傳遞：

rules = (
    Rule(LinkExtractor(allow='product/'), callback=parse_item, follow=True),
)

由于 Python 從上到下讀取，因此parse_item()在該rules行上方定義：

def parse_item(self, response):
    for doc in response.css('a.file'):
        doclink = doc.css('::attr("href")').get()
        product = Product()
        product['model'] = response.css('h2.data__symbol::text').get()
        product['brand'] = 'Fagor'
        product['file_urls'] = [doclink]
        yield product


rules = (
    Rule(LinkExtractor(allow='product/'), callback=parse_item, follow=True),
)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/369481.html

標籤：Python 网页抓取刮的

上一篇：當HeadlessTRUE時，Puppeteer找不到元素

下一篇：使用BeautifulSoup從維基百科獲取汽車品牌串列