我正在嘗試從 Fadedpage 下載一本書,比如這個。如果單擊那里的 HTML 檔案鏈接,它將顯示 HTML 檔案。該 URL 似乎是https://www.fadedpage.com/books/20170817/html.php. 但是,如果您嘗試通過任何常用方式下載該 URL,您只會獲得元資料 HTML,而不是包含書籍全文的 HTML。例如,wget https://www.fadedpage.com/books/20170817/html.php從命令列運行確實會回傳 HTML,但它還是來自 的元資料 HTML 檔案https://www.fadedpage.com/showbook.php?pid=20170817,而不是書的全文。
這是我到目前為止所嘗試的:
def downloadFile(bookID, fileType="html"):
url = f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
#url = f'https://www.fadedpage.com/link.php?file={bookID}.{fileType}'
headers = {"Accept":"text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.3 Chrome/87.0.4280.144 Safari/537.36",
"referer": "https://www.fadedpage.com/showbook.php?pid={bookID}",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"
}
print("Getting ", url)
resp = requests.get(url, headers=headers, cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"})
if resp.ok:
return resp.text
我正在嘗試為它提供與我的網路瀏覽器相同的標題,希望它會回傳相同的內容。但它不起作用。
我還需要做些什么才能下載此 HTML 檔案嗎?由于它在服務器端由 PHP 提供服務,因此我很難對其進行逆向工程。
作為參考,完整的 HTML 檔案包含文本“本書的第一部分是為能夠區分詞性的高級學生設計的。” 但該文本不包含在元資料 HTML 檔案中。
測驗
這是另一種測驗方法:
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "It was a woodland
slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = "woodland slope behind St. Pierre-les-Bains"
return test in raw
這應該回傳True:
downloadFile("20170817", "html")
isValidDownload("20170817", "html")
False
又一次嘗試
基于以下答案的更簡單的版本也不起作用。這一切都在一起:
def downloadFile(bookID, fileType):
headers = {"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
print("Getting ", url)
with requests.get(url, headers = headers) as resp:
with open(f"{bookID}.{fileType}", 'wb') as f:
f.write(resp.content)
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "It was a woodland
slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = "woodland slope behind St. Pierre-les-Bains"
return test in raw
downloadFile("20170817", "html")
isValidDownload("20170817", "html")
那回傳False。
uj5u.com熱心網友回復:
- 通過
cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}而不是headers={"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}.
這是因為requests庫headers.pop('Cookie', None)在重定向時執行。 - 如果
resp.url不是,請重試f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"。
這是因為服務器首先link.php使用不同bookID的 to重定向showbook.php。 - 的下載
downloadFile("20170817", "html")包含文本"The First Part of this book is intended for pupils",而不是"woodland slope behind St. Pierre-les-Bains"包含在下載中downloadFile("20130603", "html")。
def downloadFile(bookID, fileType, retry=1):
cookies = {"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
print("Getting ", url)
with requests.get(url, cookies=cookies) as resp:
if resp.url != f"https://www.fadedpage.com/books/{bookID}/{fileType}.php":
if retry:
return downloadFile(bookID, fileType, retry=retry-1)
else:
raise Exception
with open(f"{bookID}.{fileType}", 'wb') as f:
f.write(resp.content)
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "The First Part of
this book is intended for pupils". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = ""
if bookID == "20130603":
test = "woodland slope behind St. Pierre-les-Bains"
if bookID == "20170817":
test = "The First Part of this book is intended for pupils"
return test in raw
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/430508.html
上一篇:使用VBA從XMLHTTP請求中捕獲POST請求回應和重定向URL
下一篇:htaccess檔案隱藏子檔案夾
