我正在嘗試完成以下操作: - 在需要登錄的網頁上查找所有 .PDF 檔案 - 將 .PDF 檔案重命名為只有檔案名而不是完整 URL - 在本地用戶桌面上創建一個檔案夾 - 僅下載檔案創建的檔案夾中尚不存在 - 將給定的 .PDF 檔案下載到新檔案夾
下面的代碼登錄網站并檢索所有 .PDF 檔案,將名稱斜線為檔案名并將它們下載到檔案夾中。然而,所有關閉的檔案似乎都已損壞(無法打開)
任何關于如何修復它的反饋或建議將不勝感激。(Payload 已更改為不泄露任何憑據)
附加資訊:
Sampleurl 是登錄后網站的主頁。 Loginurl 是用戶獲得身份驗證的頁面 secure_url 是包含我要下載的所有 .PDF 的頁面
代碼:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request
# Fetch username
username = os.getlogin()
# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)
Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')
payload = {
'username': 'xxxx',
'password': 'xxx',
'ltfejs': 'xx'
}
with requests.session() as s:
print("Connecting to website")
s.post(loginurl, data=payload)
r = s.get(secure_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
print("Gathering .PDF files")
# clean the pdf link names
url_list = []
for el in links:
if(el['href'].startswith('https:')):
url_list.append(el['href'])
else:
url_list.append(Sampleurl el['href'])
pprint(url_list)
print("Downloading .PDF files")
# download the pdfs to a specified location
for url in url_list:
print(url)
fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
print(fullfilename)
request.urlretrieve(Sampleurl,fullfilename)
print("This program will automatically close in 5 seconds ")
time.sleep(5)
輸出
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\quickscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\fullscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\improvementscan.pdf
https://www.tict.io/downloads/privacylabel.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\privacylabel.pdf
This program will automatically close in 5 seconds
當手動單擊輸出中的一個超鏈接時,它會下載一個有效的 .PDF。
編輯
我已經調整了我的代碼,現在它確實將一個作業 PDF 下載到分配的檔案夾中,但是它只需要串列中的最后一個檔案并且不會為其他檔案重復回圈
print("Downloading .PDF files")
# download the pdfs to a specified location
for PDF in url_list:
fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
myfile = requests.get(PDF)
open(fullfilename, 'wb').write(myfile.content)
print("This program will automaticly close in 5 seconds ")
time.sleep(5)
只有privacylabel.pdf(在最后一個檔案URL_LIST)被下載。其他人不會出現在檔案夾中。列印PDF時,它也只回傳privacylabel.pdf
uj5u.com熱心網友回復:
在職的
我忘了把會議稱為 s
myfile = requests.get(PDF)
本來應該
myfile = s.get(PDF)
感興趣的人的作業代碼:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request
# Fetch username
username = os.getlogin()
# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)
Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')
Username = input("Username: ")
Password = input("Password: ")
payload = {
'username': (Username),
'password': (Password),
'ltfejs': 'xxx'
}
with requests.session() as s:
print("Connecting to website")
s.post(loginurl, data=payload)
r = s.get(secure_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
print("Gathering .PDF files")
# clean the pdf link names
url_list = []
for el in links:
if(el['href'].startswith('https:')):
url_list.append(el['href'])
else:
url_list.append(Sampleurl el['href'])
pprint(url_list)
print("Downloading .PDF files")
# download the pdfs to a specified location
for url in url_list:
fullfilename = os.path.join(folder_location, url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
myfile = s.get(url)
print(url)
print("Myfile response:",myfile)
open(fullfilename, 'wb').write(myfile.content)
print("This program will automaticly close in 5 seconds ")
time.sleep(5)
輸出
Username: xxxx
Password: xxxx
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/downloads/privacylabel.pdf
Myfile response: <Response [200]>
This program will automatically close in 5 seconds
結論
- 我不得不以 s 的身份呼叫會話,因為我忘記了無法訪問檔案
- 我不得不稍微更改下載代碼,因為原始嘗試使用 urlretrieve 而不是請求下載
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/354886.html
上一篇:pdf-lib使用什么顏色格式?
下一篇:將影像上傳到R閃亮可下載為pdf
