如果鏈接保持不變，如何在抓取時進入下一頁？-有解無憂

我最近在研究網路抓取，我被卡住了。我需要從下一頁洗掉資料，但只有一個可點擊的按鈕，并且鏈接保持不變。所以我的問題是如果網址保持不變，我如何提取到下一頁的鏈接？我抓取的網頁是http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp

到目前為止我的代碼：

import scrapy
import json

class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp/']

def start_requests(self):
    #sending a post request to the web
    return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                               formdata={'sch_com_nm': '',
                                         'sch_yy': '2021',
                                         'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                         'code': '02/02020000/esg02020000',
                                         'pageFirstCall': 'Y'},
                               callback=self.parse)]

def parse(self, response):
    dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
    for i in dict_data['result']:
        company_name = i['com_abbrv']
        compay_share_id = i['isu_cd']
        print(company_name, compay_share_id)

所以現在我只需要從第一頁獲取資訊。現在我必須轉到下一頁。有人可以解釋一下我該怎么做嗎？

uj5u.com熱心網友回復：

您正在抓取的網站公開了一個 API，您可以直接呼叫而不是使用 splash。如果您檢查網路選項卡，您將看到POST正在發送到服務器的請求。

請參見下面的示例代碼。我已經對總頁數進行了硬編碼，但您可以找到一種自動獲取總數的方法，而不是對值進行硬編碼。

注意使用response.follow. 它會自動處理 cookie 和其他標頭。

import scrapy

class EsgKrx1Spider(scrapy.Spider):
    name = 'esg_krx1'
    allowed_domains = ['esg.krx.co.kr']
    start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        "USER_AGENT": 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
    }

    def parse(self, response):
        #send a post request to the api
        url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
        
        headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        total_pages = 77
        for page in range(total_pages):
            payload = f"sch_com_nm=&sch_yy=2021&pagePath=/contents/02/02020000/ESG02020000.jsp&code=02/02020000/esg02020000&curPage={page 1}"
            yield response.follow(url=url, method='POST', callback=self.parse_result, headers=headers, body=payload)

    def parse_result(self, response):

        # #looping in the result and assigning the company name
        for item in response.json().get('result'):
            yield {
                'company_name': item.get('com_abbrv'),
                'compay_share_id': item.get('isu_cd')
            }

uj5u.com熱心網友回復：

我發現與您正在使用的 javascript 重網站集成起來更容易，scrapy_splash因為它們在發送請求時通常需要一段時間才能加載。因此，我創建了一個簡單的lua腳本來加載站點，然后決議所需的資訊。

您會發現有效負載包括您所在的當前頁面；通過迭代這個數字直到網站上的最后一頁，你就可以抓取下一頁。

因為像這樣的網站會很快阻止你，所以添加計時器和下載延遲非常重要，這樣它們就不會阻止你。

這是一個作業刮板：

import scrapy
from scrapy_splash import SplashRequest
import json

script = """
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(7))
  return splash:html()
end
"""
class KorenSiteSpider(scrapy.Spider):
    name = 'k-site'
    start_urls = ['https://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
        'DOWNLOAD_DELAY':3
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url = url,
                callback = self.parse, 
                endpoint='execute',
                args = {'lua_source':script}
            )

    def parse(self, response):
        for i in range(1, 78, 1):
            yield scrapy.FormRequest(
                url = 'https://esg.krx.co.kr/contents/99/ESG99000001.jspx',
                method = 'POST',
                formdata = {
                            'sch_com_nm': '',
                            'sch_yy': '2021',
                            'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                            'code': '02/02020000/esg02020000',
                            'curPage': str(i)
                            },
                callback = self.parse_json
            )

    def parse_json(self, response):
        dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
        for i in dict_data['result']:
            company_name = i['com_abbrv']
            company_share_id = i['isu_cd']
            yield {
                'company:name':company_name,
                'company_share_id':company_share_id
            }

輸出：

2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '??????', 'company_share_id': '001020'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '????', 'company_share_id': '090080'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '?????', 'company_share_id': '010770'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '???', 'company_share_id': '005490'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '?????', 'company_share_id': '058430'}

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/426936.html

標籤：网页抓取刮擦

上一篇：如何將unicode文本轉換為python可以讀取的文本，以便我可以在網路抓取結果中找到該特定單詞？

下一篇：抓取同一標題/類下的多個錨標記