目前這個爬蟲有點作業,給我一個回應,但我有一些問題。第一個是抓取頁面的順序。我希望從第 1 頁開始到我設定的范圍,此時似乎隨機執行并且還重復頁面。第二個是輸出,是否全部重復或具有空值或不按順序。不知道問題出在規則還是爬蟲。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(CrawlSpider):
name = "catspider"
start_urls = []
for i in range(1,10):
if i % 2 == 1:
start_urls.append('https://www.worldcat.org/title/rose-in-bloom/oclc/' str(i) '&referer=brief_results')
rules = (
Rule(LinkExtractor(allow='title')),
Rule(LinkExtractor(allow='oclc'), callback='parse_item')
)
def parse_item(self, response):
yield {
'title': response.css('h1.title::text').get(),
'author': response.css('td[id="bib-author-cell"] a::text').getall(),
'publisher': response.css('td[id="bib-publisher-cell"]::text').get(),
'format': response.css('span[id="editionFormatType"] span::text').get(),
'isbn': response.css('tr[id="details-standardno"] td::text').get(),
'oclc': response.css('tr[id="details-oclcno"] td::text').get()
}
額外資訊:來自對scrapy有更多經驗的人,什么更好,為什么,Xpath或css標簽?
感謝您提供任何資訊。
uj5u.com熱心網友回復:
您可以使用分頁for loop range型別比其他分頁快 2 倍的方法在 start_urls 中進行分頁。如果每個專案都包含鏈接,這是在規則中使用 xpath 的最佳方法之一。
Extra info: from someone that have more experience with scrapy what is better and why, Xpath or css tag?
根據您的評論Extra info:xpath 和 css 元素定位器都更好,但 xpath 更豐富一點,因為 xpath 可以輕松地在 html 樹中上下移動,您也可以以混合方式同時應用 xpath 和 css。這是一個作業示例。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class QuotesSpider(CrawlSpider):
name = "catspider"
start_urls = ['https://www.worldcat.org/search?q=oclc&fq=&dblist=638&start=' str(i) '1&qt=page_number_link' for i in range(1,11)]
rules = (Rule(LinkExtractor(restrict_xpaths='//*[@]/a'), callback='parse_item', follow=True),)
def parse_item(self, response):
yield {
'title' : response.css('h1.title::text').get(),
'author' : response.css('td[id="bib-author-cell"] a::text').getall(),
'publisher' : response.css('td[id="bib-publisher-cell"]::text').get(),
'format' : response.css('span[id="editionFormatType"] span::text').get(),
'isbn' : response.css('tr[id="details-standardno"] td::text').get(),
'oclc' : response.css('tr[id="details-oclcno"] td::text').get()
}
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/456912.html
