library(rvest)
imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)
基本上,上面的代碼用于抓取 50 部電影的標題和評分。我希望也將缺失值作為 NA 包含在內。
但是,這不會發生,因為 IMDB 沒有將它們包含在 HTML 標記中,而 HTML 標記僅存在實際值(我已用于SelectorGadget獲取標記)。因此,標題的觀察計數為 50,評分僅為 33,這不是我想要的。我一直在使用html_node()連同嘗試html_nodes(),但R已適時提供一個錯誤,說不能使用css和xpath在一起。我也試過trim=TRUE 和replace( !nzchar(.), NA) 但它們也不起作用。
有沒有辦法解決這個問題并確保我獲得 50 個評分(包括 NA 或空值)?
uj5u.com熱心網友回復:
您需要分 2 個步驟執行此決議。首先收集所有 50 部電影的父節點html_nodes()。然后,您使用html_node()(不帶 s)決議此節點集合以獲得所有 50 個節點的結果,包括缺少該屬性的節點。
library(rvest)
library(dplyr)
imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")
#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()
注意從rvest 1.0html_node(s)到html_element(s)當前樣式的更新
uj5u.com熱心網友回復:
我們可以ratings-user-rating用來獲取整個評分串列,

library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv"
url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()
[1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
[4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
[7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "
我們還需要清理資料以獲得評級。
df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)
[1] NA NA " 3.6" NA " 4.9" " 4.1" " 7.4" " 4.6" NA " 7.9" " 3.3" NA " 6.5" " 6.6" " 5" " 3.6" NA " 4.7" " 3.1" " 5.4" NA " 5.7"
[23] " 5.1" NA NA " 6.9" " 4.3" " 6.6" NA NA " 4.1" " 4.6" NA " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA NA NA " 4.6"
[45] NA " 6.8" NA " 6.6" " 4.4" " 1.5"
獲取電影名稱,
movie = url %>% read_html() %>% html_nodes(".lister-item-header a") %>% html_text()
data.frame(Movie = movie, ratings = df)
Movie Ratings
1 #1915House <NA>
2 #Bodygoals <NA>
3 #Followme 3.6
4 #FullMethod <NA>
5 #Like 4.9
6 #SquadGoals 4.1
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/392225.html
