在R中使用Rvest進行禮貌的Webscraping-有解無憂

我有抓取網站的代碼，但這樣做的方式是在運行這么多抓取之后，我得到一個 403 禁止錯誤。我知道 R 中有一個名為禮貌的程式包，它負責確定如何根據主機要求運行抓取，因此不會發生 403。我盡我最大的努力使其適應我的代碼，但我被卡住了。非常感謝一些幫助。這是一些示例可重現代碼，其中只有幾個鏈接：

library(tidyverse)
library(httr) 
library(rvest)
library(curl)

urls = c("https://www.pro-football-reference.com/teams/pit/2021.htm", "https://www.pro- 
football-reference.com/teams/pit/2020.htm", "https://www.pro-football- 
reference.com/teams/pit/2019.htm")


pitt <- map_dfr(
.x = urls,
 .f = function(x) {Sys.sleep(2); cat(1);
 read_html(
  curl(x, handle = curl::new_handle("useragent" = "chrome"))) %>% 
  html_nodes("table") %>% 
  html_table(header = TRUE) %>% 
  simplify() %>%
  .[[2]] %>% 
  janitor::row_to_names(row_number = 1) %>% 
  janitor::clean_names(.) %>% 
  select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>% 
  mutate(year = str_extract(string = x, pattern = "\\d{4}"))
 }
)

此運行應該沒有問題，但完整運行包括 1933-2021 年的所有年份，而不僅僅是示例中提供的三年鏈接。我愿意以任何方式負責任地使用禮貌包或任何其他專家可能更熟悉的方式來解決這個問題。

uj5u.com熱心網友回復：

這是我在這種情況下如何使用禮貌的建議。該代碼創建了一個團隊和季節網格，并禮貌地刮取資料。

決議器取自您的示例。

library(magrittr)

# Create polite session
host <- "https://www.pro-football-reference.com/"
session <- polite::bow(host, force = TRUE)

# Create grid of teams and seasons that shall be scraped
seasons <- 2020:2021
teams <- c("pit", "nor")
grid_to_scrape <- tidyr::expand_grid(team = teams, season = seasons)
grid_to_scrape
#> # A tibble: 4 × 2
#>   team  season
#>   <chr>  <int>
#> 1 pit     2020
#> 2 pit     2021
#> 3 nor     2020
#> 4 nor     2021

responses <- purrr::pmap_dfr(grid_to_scrape, function(team, season, session){
  # For some verbose status updates
  cli::cli_process_start("Scrape {.val {team}}, {.val {season}}")
  # Create full url and scrape
  full_url <- polite::nod(session, glue::glue("teams/{team}/{season}.htm"))
  scrape <- polite::scrape(full_url)
  # Parse the response, suppress Janitor warnings. This is a problem of the parser
  suppressWarnings({
    response <- scrape %>% 
      rvest::html_elements("table") %>% 
      rvest::html_table(header = TRUE) %>% 
      purrr::simplify() %>%
      .[[2]] %>%
      janitor::row_to_names(row_number = 1) %>% 
      janitor::clean_names() %>% 
      dplyr::select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>% 
      dplyr::mutate(year = season, team = team)
  })
  # Update status
  cli::cli_process_done()
  # return parsed data
  response
}, session = session)
#> ? Scrape "pit", 2020
#> ? Scrape "pit", 2020 ... done
#> 
#> ? Scrape "pit", 2021
#> ? Scrape "pit", 2021 ... done
#> 
#> ? Scrape "nor", 2020
#> ? Scrape "nor", 2020 ... done
#> 
#> ? Scrape "nor", 2021
#> ? Scrape "nor", 2021 ... done
#> 

responses
#> # A tibble: 77 × 10
#>    week  day   date       result record opponent team_score opponent_score  year
#>    <chr> <chr> <chr>      <chr>  <chr>  <chr>    <chr>      <chr>          <int>
#>  1 1     "Mon" "Septembe… "boxs… "1-0"  New Yor… "26"       "16"            2020
#>  2 2     "Sun" "Septembe… "boxs… "2-0"  Denver … "26"       "21"            2020
#>  3 3     "Sun" "Septembe… "boxs… "3-0"  Houston… "28"       "21"            2020
#>  4 4     ""    ""         ""     ""     Bye Week ""         ""              2020
#>  5 5     "Sun" "October … "boxs… "4-0"  Philade… "38"       "29"            2020
#>  6 6     "Sun" "October … "boxs… "5-0"  Clevela… "38"       "7"             2020
#>  7 7     "Sun" "October … "boxs… "6-0"  Tenness… "27"       "24"            2020
#>  8 8     "Sun" "November… "boxs… "7-0"  Baltimo… "28"       "24"            2020
#>  9 9     "Sun" "November… "boxs… "8-0"  Dallas … "24"       "19"            2020
#> 10 10    "Sun" "November… "boxs… "9-0"  Cincinn… "36"       "10"            2020
#> # … with 67 more rows, and 1 more variable: team <chr>

^{由reprex 包于 2022-02-22 創建(v2.0.1)}

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/431620.html

標籤：r 网页抓取投资

上一篇：使用scrapy找不到Xpath

下一篇：我可以從Cheerio的元素變數中再次選擇嗎？