如何使用Scrapy決議沒有特定.pdf鏈接的PDF？-有解無憂

我嘗試下載 PDF，但如果是https://ratsinformation.stadt-koeln.de/si0057.asp?__ksinr=23723 ，我看不到.pdfScrapy 可以抓取的鏈接。此示例顯示 URL https://ratsinformation.stadt-koeln.de/getfile.asp?id=850608&type=do中缺少的 .pdf 。

Scrapy 是否也能夠處理getfile.asp鏈接以檢測檔案本身？

這是獲取特定頁面上所有 pdf 鏈接的方法：

import scrapy
from scrapy.pipelines.files import FilesPipeline


class PdfPipeline(FilesPipeline):
    # to save with the name of the pdf from the website instead of hash
    def file_path(self, request, response=None, info=None):
        file_name = request.url.split('/')[-1]
        return file_name


class StadtKoelnAmtsblattSpider(scrapy.Spider):
    name = 'stadt_koeln_amtsblatt'
    start_urls = ['https://ratsinformation.stadt-koeln.de/si0057.asp?__ksinr=23723']

    custom_settings = {
        "ITEM_PIPELINES": {
            PdfPipeline: 100
        },
        "FILES_STORE": "downloaded_files"
    }

    def parse(self, response):
        links = response.xpath("//a[@class='btn btn-blue']/@href").getall()
        links = [response.urljoin(link) for link in links]  # to make them absolute urls

        yield {
            "file_urls": links
        }

每次嘗試下載檔案時，我都會收到錯誤訊息。

OSError: [Errno 22] Invalid argument: 'downloaded_files\\getfile.asp?id=821665&type=do'

uj5u.com熱心網友回復：

該錯誤是PdfPipeline由于 url 沒有檔案名引起的，因此您必須在parse方法中獲取檔案名，然后在管道中捕獲名稱，如下所示。

import scrapy
from scrapy.pipelines.files import FilesPipeline


class PdfPipeline(FilesPipeline):
    # to save with the name of the pdf from the website instead of hash
    def file_path(self, request, response=None, info=None, *, item=None):
        return item["filename"]


class StadtKoelnAmtsblattSpider(scrapy.Spider):
    name = 'stadt_koeln_amtsblatt'
    start_urls = ['https://ratsinformation.stadt-koeln.de/si0057.asp?__ksinr=23723']

    custom_settings = {
        "ITEM_PIPELINES": {
            PdfPipeline: 100
        },
        "FILES_STORE": "downloaded_files",
        "USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36"
    }

    def parse(self, response):
        for i, item in enumerate(response.xpath("//a[contains(@title, 'Dokument Download')]")):
            title = item.xpath("./text()").get()
            urls = item.xpath("./@href").getall()
            if title:
                yield {
                    "filename": title   str(i)   ".pdf",# to take care of duplicated file names
                    "file_urls": [response.urljoin(url) for url in urls]
                }

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/428374.html

標籤：Python pdf 网页抓取路径刮擦

上一篇：如何在節點JS框架中使用x509證書檔案自簽名pdf

下一篇：經典ASP：如何顯示由Json回傳的PDF