我嘗試下載 PDF,但如果是https://ratsinformation.stadt-koeln.de/si0057.asp?__ksinr=23723 ,我看不到.pdfScrapy 可以抓取的鏈接。此示例顯示 URL https://ratsinformation.stadt-koeln.de/getfile.asp?id=850608&type=do中缺少的 .pdf 。
Scrapy 是否也能夠處理getfile.asp鏈接以檢測檔案本身?
這是獲取特定頁面上所有 pdf 鏈接的方法:
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
# to save with the name of the pdf from the website instead of hash
def file_path(self, request, response=None, info=None):
file_name = request.url.split('/')[-1]
return file_name
class StadtKoelnAmtsblattSpider(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
start_urls = ['https://ratsinformation.stadt-koeln.de/si0057.asp?__ksinr=23723']
custom_settings = {
"ITEM_PIPELINES": {
PdfPipeline: 100
},
"FILES_STORE": "downloaded_files"
}
def parse(self, response):
links = response.xpath("//a[@class='btn btn-blue']/@href").getall()
links = [response.urljoin(link) for link in links] # to make them absolute urls
yield {
"file_urls": links
}
每次嘗試下載檔案時,我都會收到錯誤訊息。
OSError: [Errno 22] Invalid argument: 'downloaded_files\\getfile.asp?id=821665&type=do'
uj5u.com熱心網友回復:
該錯誤是PdfPipeline由于 url 沒有檔案名引起的,因此您必須在parse方法中獲取檔案名,然后在管道中捕獲名稱,如下所示。
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
# to save with the name of the pdf from the website instead of hash
def file_path(self, request, response=None, info=None, *, item=None):
return item["filename"]
class StadtKoelnAmtsblattSpider(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
start_urls = ['https://ratsinformation.stadt-koeln.de/si0057.asp?__ksinr=23723']
custom_settings = {
"ITEM_PIPELINES": {
PdfPipeline: 100
},
"FILES_STORE": "downloaded_files",
"USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36"
}
def parse(self, response):
for i, item in enumerate(response.xpath("//a[contains(@title, 'Dokument Download')]")):
title = item.xpath("./text()").get()
urls = item.xpath("./@href").getall()
if title:
yield {
"filename": title str(i) ".pdf",# to take care of duplicated file names
"file_urls": [response.urljoin(url) for url in urls]
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/428374.html
