各位好,我把代碼貼在下面了。我現在的主要問題是在獲取所有小說鏈接那一步,selenium瀏覽器會把所有頁面都打開一遍,然后在最后一個小說的頁面開始執行下面的代碼操作,這樣每次只能抓取一本小說,請問是哪里有問題嗎?(中間件應該是沒有問題的,因為頁面獲取都是正常的)
真的不知道該怎么辦了,專案組的老師催的還蠻緊的,請大家幫幫我

爬蟲檔案
zongheng.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider, Request
from w3lib.html import remove_tags
from zongheng.items import ZonghengItem
from selenium import webdriver
class PassageSpider(Spider):
name = 'passage'
def __init__(self):
self.browser = webdriver.Firefox()
self.browser.set_page_load_timeout(30)
def closed(self, spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = [r'http://book.zongheng.com/store/c0/c0/b0/u4/p48/v9/s1/t0/u0/i1/ALL.html']
for i in start_urls:
yield Request(url=i, callback=self.parse,dont_filter=True)
def parse(self, response):##獲取所有小說的鏈接
book_url_list = response.xpath(
"/html/body/div[2]/em/div[1]/div[1]/div/div[2]/div[1]/a/@href"
).extract()
for book_url in book_url_list:
yield Request(book_url, callback=self.parse_read,dont_filter=True)
def parse_read(self, response):#進入小說目錄
book_catalogue_list = self.browser.find_element_by_xpath(
'/html/body/div[2]/div[5]/div[1]/div[1]/div[1]/div[2]/div[5]/div[2]/a[1]'
)
book_catalogue = book_catalogue_list.get_attribute('href')
yield Request(book_catalogue, callback=self.parse_chapter)
def parse_chapter(self, response):#獲取章節鏈接
book_directory = response.xpath(
'/html/body/div[@class=\"container\"]/div/div[@class=\"volume-list\"]/div/ul[@class=\"chapter-list clearfix\"]/li/a/@href'
).extract()
for chapter in book_directory:
yield Request(chapter, callback=self.parse_content)
def parse_content(self, response):#獲取文本內容
name = response.xpath("/html/body/div[2]/div[3]/div[2]/a[3]/text()").extract_first()
print(name)
chapter_name = response.xpath(
"/html/body/div[2]/div[3]/div[3]/div/div[2]/div[2]/text()"
).extract()
chapter_content0 = response.xpath(
"/html/body/div[2]/div[3]/div[3]/div/div[5]//text()"
).extract()
chapter_content1 = []
for chapter in chapter_content0:
chapter1 = remove_tags(chapter)
chapter_content1.append(chapter1)
chapter_content = "".join(chapter_content1)
item = ZonghengItem()
item['name'] = name
item['chap_name'] = chapter_name[0]
item['chap_content'] = chapter_content
yield item
middlewares.py(這部分是我添加的代碼,其他的我沒動)
class SeleniumMiddleware(object):
def process_request(self, request, spider):
if spider.name == 'passage':
try:
spider.browser.get(request.url)
except TimeoutException as e:
print('超時')
spider.browser.execute_script('window.stop()')
time.sleep(2)
return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
encoding="utf-8", request=request)
uj5u.com熱心網友回復:
我試著打開你的start_urls,里面的引數p48是應該是指第48頁,這一頁只有一本小說,是《滄海市的戰斗》,問題是不是出在這?其他代碼暫時還沒看uj5u.com熱心網友回復:
那個網址不是問題,我換成別的頁也還是不行
uj5u.com熱心網友回復:
簡單再說一下,我現在的主要問題是selenium打開各個小說子頁面的時候并沒有執行parse_read及以下函式的操作,而是繼續打開了下一個小說的頁面,直到最后一個小說,他才開始執行parse_read后面的內容uj5u.com熱心網友回復:
我猜可能是yield的問題轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/122827.html
下一篇:用turtle寫名字
