我有以下蜘蛛,我嘗試結合分頁和規則來訪問每個頁面上的鏈接。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Paging(CrawlSpider):
name = "paging"
start_urls = ['https://ausschreibungen-deutschland.de/1/']
# Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
rules = (
Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=True),
)
def parse(self, response):
# just get all the text
all_text = response.xpath("//text()").getall()
yield {
"text": " ".join(all_text),
"url": response.url
}
# visit next page
# next_page_url = response.xpath('//a[@]').extract_first()
# if next_page_url is not None:
# yield scrapy.Request(response.urljoin(next_page_url))
我想實作以下行為:
從第 1 頁開始https://ausschreibungen-deutschland.de/1/,訪問所有 10 個鏈接并獲取文本。(已經實施)
轉到第 2 頁https://ausschreibungen-deutschland.de/2/,訪問所有 10 個鏈接并獲取文本。
轉到第 3 頁https://ausschreibungen-deutschland.de/3/,訪問所有 10 個鏈接并獲取文本。
轉到第 4 頁...
我將如何結合這兩個概念?
uj5u.com熱心網友回復:
我已經在 start_urls 中完成了分頁,您可以根據需要增加或減少頁碼。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Paging(CrawlSpider):
name = "paging"
start_urls = ['https://ausschreibungen-deutschland.de/' str(x) '/' for x in range(1,11)]
# Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
rules = (
Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=False),
)
def parse(self, response):
# just get all the text
#all_text = response.xpath("//text()").getall()
yield {
#"text": " ".join(all_text),
'title':response.xpath('//*[@]/h2//text()').get(),
"url": response.url
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/439598.html
上一篇:使用Python從reCAPTCHA保護網站抓取資料
下一篇:試圖從標題中拆分文本
