我嘗試根據股票的 Isin 代碼從網站投資網站上抓取資訊。
當我使用 Isin 代碼填寫網站頂部表單時,會通過 POST 請求發送 xhr 請求。這是我得到的 JSON 內容:
{"total":{"articles":10,"allResults":16,"quotes":6},"score":{"articles":25.00122},"articles":[...],
"quotes":[
{"pairId":386,"name":"Accor SA","flag":"France","link":"\/equities\/accor","symbol":"ACCP","type":"Action - Paris","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":2,"region":6,"industry":55,"isCrypto":false,"exchange":"Paris","exchangeID":9},
{"pairId":948559,"name":"Accor SA","flag":"UK","link":"\/equities\/accor?cid=948559","symbol":"0H59","type":"Action - Londres","pair_type_raw":"Equities","pair_type":"equities","countryID":4,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Londres","exchangeID":3},
{"pairId":33386,"name":"Accor SA","flag":"France","link":"\/equities\/accor?cid=33386","symbol":"ACp","type":"Action - BATS Europe","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"BATS Europe","exchangeID":121},
{"pairId":963294,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963294","symbol":"ACCP","type":"Action - Francfort","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Francfort","exchangeID":104},
{"pairId":963914,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963914","symbol":"ACCP","type":"Action - TradeGate","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":0,"region":6,"industry":0,"isCrypto":false,"exchange":"TradeGate","exchangeID":105},
{"pairId":993697,"name":"Accor SA","flag":"Mexico","link":"\/equities\/accor?cid=993697","symbol":"ACCN","type":"Action - Mexico","pair_type_raw":"Equities","pair_type":"equities","countryID":7,"sector":16,"region":2,"industry":129,"isCrypto":false,"exchange":"Mexico","exchangeID":53}]}
我從瀏覽器的檢查工具派生了一個 POST 請求,以檢索我需要的 JSON 資訊,而不是整個頁面:
library(httr)
codeIsin <- 'FR0000120404'
investing_url <- list(scheme="https",
host="fr.investing.com",
filename="/search/service/searchTopBar")
investing_url <- modify_url(url="",
scheme=investing_url$scheme,
hostname=investing_url$host,
path=investing_url$filename)
investing_query <- paste0("search_text=",codeIsin)
investing_headers <- list("Host" = "fr.investing.com",
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept" = "application/json, text/javascript, */*; q=0.01",
"Accept-Language" = "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
"Accept-Encoding" = "gzip, deflate, br",
"Content-Type" = "application/x-www-form-urlencoded",
"X-Requested-With" = "XMLHttpRequest",
"Content-Length" = "23",
"Origin" = "https://fr.investing.com",
"Connection" = "keep-alive",
"Pragma" = "no-cache",
"Cache-Control" = "no-cache",
"TE" = "Trailers"
)
response <- POST(url = investing_url,
query = investing_query,
header = investing_headers)
我得到一個原始內容:
typeof(response$content)
[1] "raw"
response$content
[1] 20 3c 21 44 4f 43 54 59 50 45 20 48 54 4d 4c 3e 0a 3c 68 74 6d 6c 20 64 69 72 3d 22 6c 74 72 22 20
[34] 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 31 39 39 39 2f 78 68 74
...
[958] 65 35 2a 64 29 3b 65 2b 3d 27 3b 65 78 70 69 72 65 73 3d 22 27 3b 65 2b 3d 6e 2e 74 6f 47 4d 54 53
[991] 74 72 69 6e 67 28 29 3b 65 2b
[ reached getOption("max.print") -- omitted 688441 entries ]
用 解碼后content(response, "text"),它似乎是網站的主頁。
response$request顯示未發送所有標頭,特別是"Content-Type" = "application/x-www-form-urlencoded":
> response$request
<request>
POST https://fr.investing.com/search/service/searchTopBar?search_text=FR0000120404
Output: write_memory
Options:
* useragent: libcurl/7.74.0 r-curl/4.3 httr/1.4.2
* post: TRUE
* postfieldsize: 0
Headers:
* Accept: application/json, text/xml, application/xml, */*
* Content-Type:
我的請求哪里出錯了?
uj5u.com熱心網友回復:
如果您不太依賴所使用的語法,則可以按如下方式進行切換,注意我添加了一個 cookie 標頭以允許在 httr 內進行向前重定向:
library(httr)
library(jsonlite)
headers = c(
'user-agent' = 'Safari/537.36',
'x-requested-with' = 'XMLHttpRequest',
'cookie' = 'adBlockerNewUserDomains=on')
data = list(
'search_text' = 'FR0000120404'
)
r <- httr::POST(url = 'https://fr.investing.com/search/service/searchTopBar', httr::add_headers(.headers=headers),
body =data, encode = 'form') |>
content() |>
html_element('p') |>
html_text() |>
jsonlite::parse_json()
r
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/358499.html
