PythonScrapyWebScraping：在具有ajax內容的onclick元素中獲取URL的問題-有解無憂

我是使用 scrapy 進行網路抓取的初學者。我嘗試從 goodreads.com 抓取特定書籍的用戶評論。我想刮掉所有關于書的評論??，所以我必須決議每個評論頁面。每個評論頁面下方都有一個next_page按鈕，next_page按鈕的內容嵌入在onclick元素中但是有問題。這個 onclick 鏈接包括 ajax 請求，我不知道如何處理這種情況。提前感謝您的幫助。

下一頁按鈕的圖片

它是onclick按鈕的內容

它是 onclick 按鈕的剩余部分

我也是發布stackoverflow的初學者，如果我有任何錯誤，我很抱歉。:)

我在下面分享我的抓取代碼

此外，它是本書的示例鏈接之一，頁面下方有一個評論部分。

書鏈接

import scrapy
from ..items import GoodreadsItem
from scrapy import Request
from urllib.parse import urljoin
from urllib.parse import urlparse



class CrawlnscrapeSpider(scrapy.Spider):
    name = 'crawlNscrape'
    allowed_domains = ['www.goodreads.com']
    start_urls = ['https://www.goodreads.com/list/show/702.Cozy_Mystery_Series_First_Book_of_a_Series']

    def parse(self, response):
        
        
        #collect all book links in this page then make request for 
        #parse_page function
        for href in response.css("a.bookTitle::attr(href)") :
            url=response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_page)
            
        
        #go to the next page and make request for next page and call parse 
        #function again
        next_page = response.xpath("(//a[@class='next_page'])[1]/@href")
        if next_page:
            url= response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)
        
        
            

    def parse_page(self, response):
        
        #call goodreads item and create empty dictionary with name book
        book = GoodreadsItem()
        title = response.css("#bookTitle::text").get()
        reviews = response.css(".readable span:nth-child(2)::text").getall()
        
        #add book and reviews that earned into dictionary
        book['title'] = title
        book['reviews'] = reviews#take all reviews about book in single page
        
        
        # i want to extract all of the review pages for any book ,
        # but there is a ajax request in onclick button
        # so i cant scrape link of next page.
        next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url,callback=self.parse_page)
            
       

        
        
        yield book

uj5u.com熱心網友回復：

而不是以下代碼：

next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
if next_page:
    url = response.urljoin(next_page[0].extract())
    yield scrapy.Request(url,callback=self.parse_page)

試試這個：

首先匯入此存盤庫：

from re import search

然后使用以下進行分頁：

next_page_html = response.xpath("//a[@class='next_page' and @href='#']/@onclick").get()
if next_page_html != None:
    next_page_href = search( r"Request\(.([^\'] )", next_page_html)
    if next_page_href:
        url = response.urljoin(next_page_href.group(1))
        yield scrapy.Request(url,callback=self.parse_page)

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/474250.html

標籤：javascript Python 阿贾克斯刮擦网络爬虫

上一篇：來自控制器的html的PHP變數系結問題，用于ajax回應

下一篇：如何使用DRY重寫選單