目前我正在試圖湊一個網站,組合Rselenium,rvest和tidyverse。
目標是訪問此網站,單擊其中一個鏈接(例如“促銷”),然后使用 提取整個資料表(例如卡片和分級價格)rvest。
我能夠使用以下代碼在沒有太多問題的情況下提取表:
library(RSelenium)
library(rvest)
library(tidyverse)
pokemon <- read_html("https://www.pricecharting.com/console/pokemon-promo")
price_table <- pokemon %>%
html_elements("#games_table") %>%
html_table()
但是,這有幾個問題:1)我無法在我提供的初始網站鏈接(https://www.pricecharting.com/category/pokemon-cards)上瀏覽所有不同的卡組,2)我不能使用此方法提取整個表 - 僅主要加載的內容。
為了緩解這些問題,我正在研究Rselenium。我決定做的是轉到初始網站,單擊卡片集的鏈接(例如“促銷”),然后加載整個頁面。此作業流程可以在此處顯示:
## open driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
## navigate to primary page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
## click on the link I want
remDr$findElement(using = "link text", "Promo")$clickElement()
## find the table
table <- remDr$findElement(using = "id", "games_table")
## load the entire table
table$sendKeysToElement(list(key = "end"))
## get the entire source
full_table <- remDr$getPageSource()[[1]]
## read in the table
html_page <- read_html(full_table)
## Do the `rvest` technique I had above.
html_page %>%
html_elements("#games_table") %>%
html_table()
但是,我的問題是我再次獲得相同的 51 個元素而不是整個表格。
我想知道是否有可能將我的兩種技術結合起來,以及在我的編碼程序中哪里出錯了。
uj5u.com熱心網友回復:
我解決了這個問題。有兩件事正在發生。第一個是頁面自動加載,游標位于搜索欄內。我通過remDr$findElement(using = "css", "body")$clickElement()點擊進入文本正文來擺脫這一點。接下來,正如一個很好的問題/答案所指出的那樣,如果滾動/箭頭鍵不起作用 sendKeysToElement(list(key = "up_arrow")),您應該嘗試remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")。
因此,我的腳本的一小部分示例如下:
library(RSelenium)
library(rvest)
library(tidyverse)
## opens the driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
link_texts <- c("Base Set", "Promo", "Fossil")
## navigates to the correct page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
for (name in link_texts) {
## finds the link and clicks on it
remDr$findElement(using = "link text", name)$clickElement()
## gets the table path
remDr$findElement(using = "css", "body")$clickElement()
## finds the table - this line may be extraneous
table <- remDr$findElement(using = "css", "body")
## scrolls to the bottom of the table
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
## get the entire page source that's been loaded
html <- remDr$getPageSource()[[1]]
## read in the page source
page <- read_html(html)
data_name <- str_to_lower(str_replace(name, " ","_"))
## extract the tabular table
df <- page %>%
html_elements("#games_table") %>%
html_table() %>%
pluck(1) %>%
select(1:4)
assign(data_name, df)
Sys.sleep(3)
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
}
## close driver
remDr$close()
rD$server$stop()
uj5u.com熱心網友回復:
該頁面沒有向下滾動,因為默認情況下游標位于搜索欄中。因此對您的代碼進行了一些修改,使其完全向下滾動。
#Launch browser
rD <- rsDriver(browser="firefox", port=9545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
remDr$findElement(using = "link text", "Promo")$clickElement()
#clicking outside the search bar
remDr$findElement(using = "xpath", value = '//*[@id="console-page"]')$clickElement()
webElem <- remDr$findElement("css", "body")
#looping to get at the end of the page.
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
#extract table
full_table <- remDr$getPageSource()[[1]]
html_page <- read_html(full_table)
html_page %>%
html_elements("#games_table") %>%
html_table()
[[1]]
# A tibble: 888 x 5
Card Ungraded `Grade 9` `PSA 10` ``
<chr> <chr> <chr> <chr> <chr>
1 Mew #8 $3.99 $38.79 $75.62 " Collection\n In One Click\n ~
2 Mewtwo #3 $8.28 $65.91 $227.50 " Collection\n In One Click\n ~
3 Charizard GX #SM211 $7.85 $23.64 $53.50 " Collection\n In One Click\n ~
4 Charizard V #SWSH050 $8.00 $34.99 $79.98 " Collection\n In One Click\n ~
5 Pikachu #24 $138.31 $362.72 $2,919.69 " Collection\n In One Click\n ~
6 Entei #34 $8.50 $52.21 $153.63 " Collection\n In One Click\n ~
7 Ancient Mew $23.79 $99.99 $382.50 " Collection\n In One Click\n ~
8 Charizard EX #XY121 $27.16 $135.00 $727.00 " Collection\n In One Click\n ~
9 Mewtwo EX #XY107 $5.54 $77.50 $98.71 " Collection\n In One Click\n ~
10 Charizard GX #SM60 $28.57 $113.98 $492.00 " Collection\n In One Click\n ~
# ... with 878 more rows
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/346512.html
上一篇:如何從該網站提取表格資料?
