我正在嘗試運行我的第一個 CrawlSpider,但程式終止時沒有任何錯誤,雖然它沒有回傳任何內容,但它以零結果終止。我的代碼有什么問題?
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class FagorelectrodomesticoSpider(CrawlSpider):
name = 'fagorelectrodomestico.com'
allowed_domains = ['fagorelectrodomestico.com']
start_urls = ['https://fagorelectrodomestico.com']
rules = (
Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for doc in response.css('a.file'):
doclink = doc.css('::attr("href")').get()
product = Product()
product['model'] = response.css('h2.data__symbol::text').get()
product['brand'] = 'Fagor'
product['file_urls'] = [doclink]
yield product
uj5u.com熱心網友回復:
主要問題是此頁面用于JavaScript將所有元素添加到 HTML 但Scrapy無法運行JavaScript。如果您JavaScript在瀏覽器中關閉并重新加載此頁面,那么您應該會看到空白頁面。但是有一個模塊scrapy_selenium可以使用模塊Selenium來控制可以運行的真實Web瀏覽器JavaScript(但它會運行得更慢)。
其他問題:您的規則搜索鏈接product/我在主頁上看不到,但我可以在帶有類別的頁面上看到。但是你不需要規則來加載其他頁面,它不能product/ 從子頁面獲取鏈接- 所以它需要另一個規則來獲取其他鏈接并發送到回呼parser(在Spider加載頁面中,搜索所有鏈接并檢查這些鏈接的規則) .
它可能需要添加/en/到起始 url 以獲得與product/. 西班牙語版有鏈接productos/。
需要使用一些代碼SeleniumRequest而不是標準Request代碼 - 我從CrawlSpider 的源代碼中獲取了一些代碼并添加了一些更改。
我也曾經CrawlerProcess在不創建專案的情況下運行代碼 - 所以每個人都可以簡單地復制并運行python script.py
它將檔案下載到檔案夾full。
我只測驗了沒有選項-headless以查看它在瀏覽器中的內容。您可能需要進行測驗,-headless因為它可能作業得更快,但有時它會有所不同。
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import scrapy_selenium
class FagorelectrodomesticoSpider(CrawlSpider):
name = 'fagorelectrodomestico.com'
allowed_domains = ['fagorelectrodomestico.com']
start_urls = ['https://fagorelectrodomestico.com/en/']
rules = (
Rule(LinkExtractor(allow='/en/product/'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow='/en/', deny='/en/product/'), callback='parse', follow=True),
)
def start_requests(self):
print('[start_requests]')
for url in self.start_urls:
print('[start_requests] url:', url)
yield scrapy_selenium.SeleniumRequest(url=url, callback=self.parse)
def parse(self, response):
print('[parse] url:', response.url)
for rule_index, rule in enumerate(self._rules):
#print(rule.callback)
for link in rule.link_extractor.extract_links(response):
yield scrapy_selenium.SeleniumRequest(
url=link.url,
callback=rule.callback,
errback=rule.errback,
meta=dict(rule=rule_index, link_text=link.text),
)
def parse_item(self, response):
print('[parse_item] url:', response.url)
for doc in response.css('a.file'):
doclink = doc.css('::attr("href")').get()
product = {
'model': response.css('h2.data__symbol::text').get(),
'brand': 'Fagor',
'file_urls': [doclink],
}
yield product
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1}, # used standard FilesPipeline (download to FILES_STORE/full)
#'FILES_STORE': '/path/to/valid/dir', # this folder has to exist before downloading
'FILES_STORE': '.', # this folder has to exist before downloading
'SELENIUM_DRIVER_NAME': 'firefox',
'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
#'SELENIUM_DRIVER_ARGUMENTS': ['-headless'], # '--headless' if using chrome instead of firefox
'SELENIUM_DRIVER_ARGUMENTS': [],
#'SELENIUM_BROWSER_EXECUTABLE_PATH': '',
#'SELENIUM_COMMAND_EXECUTOR': '',
'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800}
})
c.crawl(FagorelectrodomesticoSpider)
c.start()
uj5u.com熱心網友回復:
從閱讀檔案看來,這一行可能是問題所在:
rules = (
Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)
檔案說:
callback 是一個可呼叫物件或字串(在這種情況下,將使用來自具有該名稱的蜘蛛物件的方法)為使用指定鏈接提取器提取的每個鏈接呼叫。
您parse_item是可呼叫的,而不是來自蜘蛛物件的方法。因此,我認為您應該將其作為可呼叫物件傳遞:
rules = (
Rule(LinkExtractor(allow='product/'), callback=parse_item, follow=True),
)
由于 Python 從上到下讀取,因此parse_item()在該rules行上方定義:
def parse_item(self, response):
for doc in response.css('a.file'):
doclink = doc.css('::attr("href")').get()
product = Product()
product['model'] = response.css('h2.data__symbol::text').get()
product['brand'] = 'Fagor'
product['file_urls'] = [doclink]
yield product
rules = (
Rule(LinkExtractor(allow='product/'), callback=parse_item, follow=True),
)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/369481.html
