哪位大神幫我看看,我有scrapy框架里的rules規則爬蟲,為什么翻不了頁,爬出來的資料是空的,爬的是騰訊招聘網
下面是代碼
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Txzp1Spider(CrawlSpider):
name = 'txzp1'
# allowed_domains = ['hr.tencent.com']
start_urls = ['https://hr.tencent.com/position.php?lid=&tid=&keywords=java&start=0#a']
rules = (
Rule(LinkExtractor(allow=r'position.php?lid=&tid=&keywords=java&start=\d#a'),follow=True),
Rule(LinkExtractor(allow=r'position_detail.php?id=\d+&keywords=java&tid=0&lid=0'),
callback="parse_detail", follow=False),
)
def parse_detail(self, response):
print("===========")
title = response.xpath("//tr[@class='h']/td/text()").get()
region = response.xpath("//tr[@class='c bottomline']/td[1]/text()").get()
position_type = response.xpath("//tr[@class='c bottomline']/td[2]/text()").get()
number = response.xpath("//tr[@class='c bottomline']/td[3]/text()").get()
duty = response.xpath(
"//table[@class='tablelist textl']//tr[@class='c'][1]//ul[@class='squareli']/li/text()").getall()
yaoqiu = response.xpath(
"//table[@class='tablelist textl']//tr[@class='c'][2]//ul[@class='squareli']/li/text()").getall()
item = {"title": title, "position_type": position_type, "number": number, "region": region, "duty": duty,
"yaoqiu": yaoqiu}
print(item)
uj5u.com熱心網友回復:
應該是這個地址的問題吧LinkExtractor(allow=r'position.php?lid=&tid=&keywords=java&start=\d#a'),follow=True
start 后面是不是應該接一個具體的頁碼數字,這樣相當于是傳了個\d并沒有翻頁。所以也獲取不到內容
uj5u.com熱心網友回復:
應該動態的把頁碼傳過去給start的uj5u.com熱心網友回復:
資料是空的:少了 yield itemuj5u.com熱心網友回復:
規則寫錯了吧,寫錯才會是空轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/79418.html
