長期潛伏者第一次海報。我是網路抓取和 R 的新手,我的代碼主要是從 stackoverflow 和 Youtube 生成的,所以我希望有人能幫助我解決我遇到的難題。提前謝謝了。
最近,我一直在練習抓取鏈接。對于憂思科學家聯盟的博客文章,這變得膨脹了,見下文,為效率低下道歉,我是新人。
library(rvest)
library(dplyr)
library(readr)
library(stringr)
UCS_blog_links = data.frame()
for(page_result in seq(from = 1, to = 3, by = 1)) {
link = paste0("https://blog.ucsusa.org/page/",page_result)
page = read_html(link)
url_links = page%>% html_nodes(".post-thumbnail") %>%
html_attr("href")
UCS_blog_links = rbind(UCS_blog_links, data.frame(url_links, stringsAsFactors = FALSE))%>%
distinct()
print(paste("Page:", page_result))
}
但是當我在憂思科學家聯盟新聞稿上嘗試同樣的方法時,鏈接不在主頁上,它們在“后面” .dialog-off-canvas-main-canvas 所以我想知道是否有人有任何修改提示我必須首先進入節點 .dialog-off-canvas-main-canvas 的代碼,然后抓取鏈接。或者如果需要其他方法。
uj5u.com熱心網友回復:
我們可以通過以下方式獲取鏈接,
url = 'https://www.ucsusa.org/about/news/press-releases' %>% read_html() %>% html_nodes('.view-content') %>% html_nodes('a') %>% html_attr('href')
url = unique(url)
[1] "/about/news/experts-tell-epa-follow-science-protect-communities-ethylene-oxide"
[2] "/about/news/new-sec-rule-vital-transparent-accounting-mounting-climate-risks-businesses-protecting-1"
[3] "/about/news/union-concerned-scientists-applauds-repeal-trump-era-agency-action-scrapping-californias"
[4] "/about/news/proposed-epa-truck-pollution-standard-falls-short-whats-needed-healthier-safer-future"
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/455150.html
