使用請求/BeautifullSoup下載HTTPS背后的PDF不起作用-有解無憂

我正在嘗試完成以下操作： - 在需要登錄的網頁上查找所有 .PDF 檔案 - 將 .PDF 檔案重命名為只有檔案名而不是完整 URL - 在本地用戶桌面上創建一個檔案夾 - 僅下載檔案創建的檔案夾中尚不存在 - 將給定的 .PDF 檔案下載到新檔案夾

下面的代碼登錄網站并檢索所有 .PDF 檔案，將名稱斜線為檔案名并將它們下載到檔案夾中。然而，所有關閉的檔案似乎都已損壞（無法打開）

任何關于如何修復它的反饋或建議將不勝感激。（Payload 已更改為不泄露任何憑據）

附加資訊：

Sampleurl 是登錄后網站的主頁。 Loginurl 是用戶獲得身份驗證的頁面 secure_url 是包含我要下載的所有 .PDF 的頁面

代碼：

# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request

# Fetch username
username = os.getlogin()    

# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)

Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')



payload = {
    'username': 'xxxx',
    'password': 'xxx',
    'ltfejs': 'xx'
    
}



  
with requests.session() as s:
    print("Connecting to website")
    s.post(loginurl, data=payload)
    r = s.get(secure_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('a', href=re.compile(r'(.pdf)'))


    print("Gathering .PDF files")
    # clean the pdf link names
    url_list = []
    for el in links:
        if(el['href'].startswith('https:')):
            url_list.append(el['href'])
        else:
            url_list.append(Sampleurl   el['href'])
    
    pprint(url_list)


    
    print("Downloading .PDF files")
        
    # download the pdfs to a specified location
    for url in url_list:
        print(url)
        fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        print(fullfilename)
        request.urlretrieve(Sampleurl,fullfilename)

     
            
print("This program will automatically close in 5 seconds ")
time.sleep(5)

輸出

Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
 'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\quickscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\fullscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\improvementscan.pdf
https://www.tict.io/downloads/privacylabel.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\privacylabel.pdf
This program will automatically close in 5 seconds

當手動單擊輸出中的一個超鏈接時，它會下載一個有效的 .PDF。

編輯

我已經調整了我的代碼，現在它確實將一個作業 PDF 下載到分配的檔案夾中，但是它只需要串列中的最后一個檔案并且不會為其他檔案重復回圈

    print("Downloading .PDF files")
        
    # download the pdfs to a specified location
    for PDF in url_list:
        fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        myfile = requests.get(PDF) 
        open(fullfilename, 'wb').write(myfile.content)
        

print("This program will automaticly close in 5 seconds ")
time.sleep(5)

只有privacylabel.pdf（在最后一個檔案URL_LIST）被下載。其他人不會出現在檔案夾中。列印PDF時，它也只回傳privacylabel.pdf

uj5u.com熱心網友回復：

在職的

我忘了把會議稱為 s

myfile = requests.get(PDF)

本來應該

myfile = s.get(PDF)

感興趣的人的作業代碼：

# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request


# Fetch username
username = os.getlogin()    

# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)

Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')


    

Username = input("Username: ")
Password = input("Password: ")

payload = {
    'username': (Username),
    'password': (Password),
    'ltfejs': 'xxx'
    
}

  
with requests.session() as s:
    print("Connecting to website")
    s.post(loginurl, data=payload)
    r = s.get(secure_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('a', href=re.compile(r'(.pdf)'))

    print("Gathering .PDF files")
    # clean the pdf link names
    url_list = []
    for el in links:
        if(el['href'].startswith('https:')):
            url_list.append(el['href'])
        else:
            url_list.append(Sampleurl   el['href'])
    
    pprint(url_list)

   
    
    print("Downloading .PDF files")
    
# download the pdfs to a specified location
    for url in url_list:
        fullfilename = os.path.join(folder_location, url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        myfile = s.get(url)
        print(url)
        print("Myfile response:",myfile)
        open(fullfilename, 'wb').write(myfile.content)
                

print("This program will automaticly close in 5 seconds ")
time.sleep(5)

輸出

Username: xxxx
Password: xxxx
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
 'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/downloads/privacylabel.pdf
Myfile response: <Response [200]>
This program will automatically close in 5 seconds

結論

我不得不以 s 的身份呼叫會話，因為我忘記了無法訪問檔案
我不得不稍微更改下載代碼，因為原始嘗試使用 urlretrieve 而不是請求下載

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/354886.html

標籤：Python pdf 美汤 https 蟒蛇请求

上一篇：pdf-lib使用什么顏色格式？

下一篇：將影像上傳到R閃亮可下載為pdf