我正在動態網頁上進行一些網頁抓取,并且希望優化該程序,因為它非常慢。該網頁顯示了一系列帶有資訊的銷售,向下滾動時會顯示更多銷售,盡管銷售數量有限。我所做的是增加視窗大小,以便它可以在不滾動的情況下加載幾乎所有銷售。但是,這需要一段時間才能加載,因為有很多資訊和影像。我提取的資訊是價格、資產名稱和與資產關聯的鏈接(當您單擊影像時)。
我的目標是盡可能優化這個程序。這樣做的一種方法是不加載影像,因為我不需要它們,但我找不到使用 Firefox 加載的方法。
任何改進將不勝感激。
library(RSelenium)
library(rvest)
url <- "https://cnft.io/marketplace?project=Boss Cat Rocket Club&sort=_id:-1&type=listing,offer"
exCap <- list("moz:firefoxOptions" = list(args = list('--headless'))) # Hide browser --headless
rD <- rsDriver(browser = "firefox", port = as.integer(sample(4000:4700, 1)),
verbose = FALSE, extraCapabilities = exCap)
remDr <- rD[["client"]]
remDr$setWindowSize(30000, 30000)
remDr$navigate(url)
Sys.sleep(300)
html <- remDr$getPageSource()[[1]]
remDr$close()
html <- read_html(html)
uj5u.com熱心網友回復:
好吧,在對該網站進行了一些挖掘之后,我找到了所有串列的 API:https : //api.cnft.io/market/listings。它需要一個 POST 請求并將回傳分頁的 JSON 字串。我們可以httr用來發送這樣的請求。這是用于您的網路抓取任務的小腳本。
api_link <- "https://api.cnft.io/market/listings"
project <- "Boss Cat Rocket Club"
query <- function(page, url, project) {
httr::content(httr::POST(
url = url,
body = list(
search = "",
types = c("listing", "offer"),
project = project,
sort = list(`_id` = -1L),
priceMin = NULL,
priceMax = NULL,
page = page,
verified = TRUE,
nsfw = FALSE,
sold = FALSE,
smartContract = FALSE
),
encode = "json"
), simplifyVector = TRUE)
}
query_all <- function(url, project) {
n <- query(1L, url, project)[["count"]]
out <- vector("list", n)
for (i in seq_len(n)) {
out[[i]] <- query(i, url, project)[["results"]]
if (length(out[[i]]) < 1L)
return(out[seq_len(i - 1L)])
}
out
}
collect_data <- function(results) {
dplyr::tibble(
asset_id = results[["asset"]][["assetId"]],
price = results[["price"]],
link = paste0("https://cnft.io/token/", results[["_id"]])
)
}
system.time(
dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()
)
dt
輸出(大約需要 12 秒完成)
> system.time(
dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()
)
user system elapsed
0.78 0.00 12.33
> dt
# A tibble: 2,161 x 3
asset_id price link
<chr> <dbl> <chr>
1 BossCatRocketClub1373 222000000 https://cnft.io/token/61ce22eb4185f57d50190079
2 BossCatRocketClub4639 380000000 https://cnft.io/token/61ce229b9163f2db80db98fe
3 BossCatRocketClub5598 505000000 https://cnft.io/token/61ce22954185f57d5018e2ff
4 BossCatRocketClub2673 187000000 https://cnft.io/token/61ce2281ceed93ea12ae32ec
5 BossCatRocketClub1721 350000000 https://cnft.io/token/61ce2281398627cc52c5844c
6 BossCatRocketClub673 300000000 https://cnft.io/token/61ce22724185f57d5018d645
7 BossCatRocketClub5915 200000000000 https://cnft.io/token/61ce2241398627cc52c56eae
8 BossCatRocketClub5699 350000000 https://cnft.io/token/61ce21fa398627cc52c55644
9 BossCatRocketClub4570 350000000 https://cnft.io/token/61ce21ef4185f57d5018a9d4
10 BossCatRocketClub6125 250000000 https://cnft.io/token/61ce21e49163f2db80db58dd
# ... with 2,151 more rows
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/401181.html
