我試圖抓取這個網站,但沒有得到表格資料。我什至從 Chrome 開發工具中獲得了請求資料,但我無法找出我做錯了什么。
這是我的腳本:
import requests,json
url='https://www.assetmanagement.hsbc.de/api/v1/nav/funds'
payload={"appliedFilters":[[{"active":True,"id":"Yes"}]],"paging":{"fundsPerPage":-1,"currentPage":1},"view":"Documents","searchTerm":[],"selectedValues":[],"pageInformation":{"country":"DE","language":"DE","investorType":"INST","tokenIssue":{"url":"/api/v1/token/issue"},"dataUrl":{"url":"/api/v1/nav/funds","id":"e0FFNDg5MTJELUFEMzEtNEQ5RC04MzA4LTdBQzZERTgyQTc4Rn0="},"shareClassUrl":{"url":"/api/v1/nav/shareclass","id":"ezUxODdjODJiLWY1YmItNDIzOC1hM2Y0LWY5NzZlY2JmMTU3OX0="},"filterUrl":{"url":"/api/v1/nav/filters","id":"ezRFREYxQTU3LTVENkYtNDBDRC1CMjJDLTQ0NDc4Nzc1NTlFQn0="},"presentationUrl":{"url":"/api/v1/nav/presentation","id":"e0E1NEZDODZGLUE5MDctNDUzQi04RTYyLTIxNDNBMEM1MEVGQ30="},"liveDataUrl":{"id":"ezlEMjA2MDk5LUNCRTItNENGMy1BRThBLUM0RTMwMEIzMjlDQ30="},"fundDetailPageUrl":"/de/institutional-investors/fund-centre","forceHttps":True}}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
r = requests.post(url,headers=headers,data=payload)
print(r.content)
uj5u.com熱心網友回復:
雖然它最初缺少IFC-Cache-Headerhttp 標頭,但還有一個通過Authorization標頭傳遞的 JWT 令牌。
要檢索此令牌,您首先需要從根頁面提取值:
GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre
它具有以下 javacript 物件:
window.HSBC.dpas = {
"pageInformation": {
"country": "X", <========= HERE
"language": "X", <========= HERE
"tokenIssue": {
"url": "/api/v1/token/issue",
},
"dataUrl": {
"url": "/api/v1/nav/funds",
"id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
},
....
}
}
您可以window.HSBC.dpas使用正則運算式提取javascript 物件值,然后重新格式化字串,使其成為有效的 JSON
然后將這些值傳遞到 http 標頭中,例如X-COUNTRY,X-COMPONENT并X-LANGUAGE傳遞給以下呼叫:
GET https://www.assetmanagement.hsbc.de/api/v1/token/issue
它直接回傳 JWT 令牌并將Authorization標頭添加到請求中Authorization: Bearer {token}:
GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds
例子:
import requests
import re
import json
api_url = "https://www.assetmanagement.hsbc.de/api/v1"
funds_url=f"{api_url}/nav/funds"
token_url = f"{api_url}/token/issue"
# call the /fund-centre url to get the documentID value in the javascript
url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
r = requests.get(url,
params = {
"f":"Yes",
"n": "-1",
"v": "Documents"
})
# this gets the javascript object
res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
group = res.group(1)
# convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
result_json = re.sub(regex, "", group, 0)
result = json.loads(result_json)
print(result["pageInformation"]["dataUrl"])
# call /token/issue API to get a token
r = requests.post(token_url,
headers= {
"X-Country": result["pageInformation"]["country"],
"X-Component": result["pageInformation"]["dataUrl"]["id"],
"X-Language": result["pageInformation"]["language"]
}, data={})
token = r.text
print(token)
# call /nav/funds API
payload={
"appliedFilters":[[{"active":True,"id":"Yes"}]],
"paging":{"fundsPerPage":-1,"currentPage":1},
"view":"Documents",
"searchTerm":[],
"selectedValues":[],
"pageInformation": result["pageInformation"]
}
headers={
"IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
"Authorization": f"Bearer {token}"
}
r = requests.post(funds_url,headers=headers,json=payload)
print(r.content)
在 repl.it 上試試這個
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/321006.html
上一篇:如何使用美麗的湯從標簽中提取內容
下一篇:網頁抓取R中的進度條
