用于下載網站上提供的所有pdf的R代碼：網頁抓取-有解無憂

我想在 R 中編碼，它可以下載此 URL 上給出的所有 pdf：https ://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook of Statistics on Indian Economy 然后下載檔案夾中的所有pdf。我在https://towardsdatascience.com的幫助下嘗試了以下代碼，但代碼出錯了

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx? 
head=Handbook of Statistics on Indian Economy") %>%

raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>%  # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://rbi.org.in", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.rbi.org.in", .) %>% # prepend the website again to get a full url
for (url in raw_list)
{ download.file(url, destfile = basename(url), mode = "wb") 
}

我無法解釋為什么代碼出錯。如果有人可以幫助我。

uj5u.com熱心網友回復：

有一些小錯誤。該網站使用大寫字母作為 PDF 結尾，您不需要使用str_c("https://rbi.org.in", .). 最后，我認為使用 purrr 的 walk2 函式更流暢（因為它可能在原始代碼中）。

我沒有執行代碼，因為我不需要那么多pdf，所以，報告它是否有效。

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook of Statistics on Indian Economy")
  
  raw_list <- page %>% # takes the page above for which we've read the html
  html_nodes("a") %>%  # find all links in the page
  html_attr("href") %>% # get the url for these links
  str_subset("\\.PDF") %>% 
  walk2(., basename(.), download.file, mode = "wb")

uj5u.com熱心網友回復：

在嘗試運行您的代碼時，我遇到了“驗證您是人類”和“請確保您的瀏覽器啟用了 Javascript”對話框。這表明您無法使用 Rvest 打開頁面，而需要使用RSelenium 瀏覽器自動化。

這是使用 RSelenium 的修改版本

library(tidyverse)
library(stringr)
library(purrr)
library(rvest)

library(RSelenium)

rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook of Statistics on Indian Economy")
page <- remDr$getPageSource()[[1]]
read_html(page) -> html

html %>%
html_nodes("a") %>%  
html_attr("href") %>% 
str_subset("\\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF") -> filenames

for(u in 1:length(urls)) {
 cat(paste('downloading: ', u, ' of ', length(urls) '\n'))
 download.file(urls[u], filenames[u], mode='wb')
 Sys.sleep(1)
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/338797.html

標籤：html r json 网址网页抓取

上一篇：如何讓R讀取CSV的實際值而不是指數格式

下一篇：將選擇/除外聯合保存到臨時表中