我正在使用 rvest 來查看 IMDB 串列并希望訪問完整演員和作業人員的串列。不幸的是,當您單擊標題時,IMDB 創建了一個摘要頁面,并將我帶到錯誤的頁面。
這是我得到的網頁:https : //www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
這是我需要的網頁:https : //www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
注意 URL 中添加的 /fullcredits。
如何將 /fullcredits 插入到我構建的 URL 的中間?
#install.packages("rvest")
#install.packages("dplyr")
library(rvest) #webscraping package
library(dplyr) #piping
link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)
name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>% html_attr("href") %>% paste("https://www.imdb.com", ., sep="")
uj5u.com熱心網友回復:
這是一個選項 -從鏈接中獲取dirname和basename,basename用新的子字串(“tt_ql_cl”)替換 的子字串,然后在中間file.path插入“fullcredits”后再次加入它們
library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits",
str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-輸出
> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/368705.html
下一篇:有沒有辦法以某種格式列印圖形?
