我有抓取網站的代碼,但這樣做的方式是在運行這么多抓取之后,我得到一個 403 禁止錯誤。我知道 R 中有一個名為禮貌的程式包,它負責確定如何根據主機要求運行抓取,因此不會發生 403。我盡我最大的努力使其適應我的代碼,但我被卡住了。非常感謝一些幫助。這是一些示例可重現代碼,其中只有幾個鏈接:
library(tidyverse)
library(httr)
library(rvest)
library(curl)
urls = c("https://www.pro-football-reference.com/teams/pit/2021.htm", "https://www.pro-
football-reference.com/teams/pit/2020.htm", "https://www.pro-football-
reference.com/teams/pit/2019.htm")
pitt <- map_dfr(
.x = urls,
.f = function(x) {Sys.sleep(2); cat(1);
read_html(
curl(x, handle = curl::new_handle("useragent" = "chrome"))) %>%
html_nodes("table") %>%
html_table(header = TRUE) %>%
simplify() %>%
.[[2]] %>%
janitor::row_to_names(row_number = 1) %>%
janitor::clean_names(.) %>%
select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>%
mutate(year = str_extract(string = x, pattern = "\\d{4}"))
}
)
此運行應該沒有問題,但完整運行包括 1933-2021 年的所有年份,而不僅僅是示例中提供的三年鏈接。我愿意以任何方式負責任地使用禮貌包或任何其他專家可能更熟悉的方式來解決這個問題。
uj5u.com熱心網友回復:
這是我在這種情況下如何使用禮貌的建議。該代碼創建了一個團隊和季節網格,并禮貌地刮取資料。
決議器取自您的示例。
library(magrittr)
# Create polite session
host <- "https://www.pro-football-reference.com/"
session <- polite::bow(host, force = TRUE)
# Create grid of teams and seasons that shall be scraped
seasons <- 2020:2021
teams <- c("pit", "nor")
grid_to_scrape <- tidyr::expand_grid(team = teams, season = seasons)
grid_to_scrape
#> # A tibble: 4 × 2
#> team season
#> <chr> <int>
#> 1 pit 2020
#> 2 pit 2021
#> 3 nor 2020
#> 4 nor 2021
responses <- purrr::pmap_dfr(grid_to_scrape, function(team, season, session){
# For some verbose status updates
cli::cli_process_start("Scrape {.val {team}}, {.val {season}}")
# Create full url and scrape
full_url <- polite::nod(session, glue::glue("teams/{team}/{season}.htm"))
scrape <- polite::scrape(full_url)
# Parse the response, suppress Janitor warnings. This is a problem of the parser
suppressWarnings({
response <- scrape %>%
rvest::html_elements("table") %>%
rvest::html_table(header = TRUE) %>%
purrr::simplify() %>%
.[[2]] %>%
janitor::row_to_names(row_number = 1) %>%
janitor::clean_names() %>%
dplyr::select(week, day, date, result = x_2, record = rec, opponent = opp, team_score = tm, opponent_score = opp_2) %>%
dplyr::mutate(year = season, team = team)
})
# Update status
cli::cli_process_done()
# return parsed data
response
}, session = session)
#> ? Scrape "pit", 2020
#> ? Scrape "pit", 2020 ... done
#>
#> ? Scrape "pit", 2021
#> ? Scrape "pit", 2021 ... done
#>
#> ? Scrape "nor", 2020
#> ? Scrape "nor", 2020 ... done
#>
#> ? Scrape "nor", 2021
#> ? Scrape "nor", 2021 ... done
#>
responses
#> # A tibble: 77 × 10
#> week day date result record opponent team_score opponent_score year
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 1 "Mon" "Septembe… "boxs… "1-0" New Yor… "26" "16" 2020
#> 2 2 "Sun" "Septembe… "boxs… "2-0" Denver … "26" "21" 2020
#> 3 3 "Sun" "Septembe… "boxs… "3-0" Houston… "28" "21" 2020
#> 4 4 "" "" "" "" Bye Week "" "" 2020
#> 5 5 "Sun" "October … "boxs… "4-0" Philade… "38" "29" 2020
#> 6 6 "Sun" "October … "boxs… "5-0" Clevela… "38" "7" 2020
#> 7 7 "Sun" "October … "boxs… "6-0" Tenness… "27" "24" 2020
#> 8 8 "Sun" "November… "boxs… "7-0" Baltimo… "28" "24" 2020
#> 9 9 "Sun" "November… "boxs… "8-0" Dallas … "24" "19" 2020
#> 10 10 "Sun" "November… "boxs… "9-0" Cincinn… "36" "10" 2020
#> # … with 67 more rows, and 1 more variable: team <chr>
由reprex 包于 2022-02-22 創建(v2.0.1)
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/431620.html
上一篇:使用scrapy找不到Xpath
