我們如何從R中的IMDB中抓取缺失值？-有解無憂

library(rvest)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)

基本上，上面的代碼用于抓取 50 部電影的標題和評分。我希望也將缺失值作為 NA 包含在內。

但是，這不會發生，因為 IMDB 沒有將它們包含在 HTML 標記中，而 HTML 標記僅存在實際值（我已用于SelectorGadget獲取標記）。因此，標題的觀察計數為 50，評分僅為 33，這不是我想要的。我一直在使用html_node（）連同嘗試html_nodes()，但R已適時提供一個錯誤，說不能使用css和xpath在一起。我也試過trim=TRUE 和replace( !nzchar(.), NA) 但它們也不起作用。

有沒有辦法解決這個問題并確保我獲得 50 個評分（包括 NA 或空值）？

uj5u.com熱心網友回復：

您需要分 2 個步驟執行此決議。首先收集所有 50 部電影的父節點html_nodes()。然后，您使用html_node()（不帶 s）決議此節點集合以獲得所有 50 個節點的結果，包括缺少該屬性的節點。

library(rvest)
library(dplyr)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")

#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")

#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()

注意從rvest 1.0html_node(s)到html_element(s)當前樣式的更新

uj5u.com熱心網友回復：

我們可以ratings-user-rating用來獲取整個評分串列，

我們如何從 R 中的 IMDB 中抓取缺失值？

library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv" 

url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()

 [1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
 [4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
 [7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "  
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "

我們還需要清理資料以獲得評級。

df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)

 [1] NA     NA     " 3.6" NA     " 4.9" " 4.1" " 7.4" " 4.6" NA     " 7.9" " 3.3" NA     " 6.5" " 6.6" " 5"   " 3.6" NA     " 4.7" " 3.1" " 5.4" NA     " 5.7"
[23] " 5.1" NA     NA     " 6.9" " 4.3" " 6.6" NA     NA     " 4.1" " 4.6" NA     " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA     NA     NA     " 4.6"
[45] NA     " 6.8" NA     " 6.6" " 4.4" " 1.5"

獲取電影名稱，

movie = url %>% read_html() %>%  html_nodes(".lister-item-header a") %>% html_text()

data.frame(Movie = movie, ratings = df)
                                                    Movie Ratings
1                                              #1915House    <NA>
2                                              #Bodygoals    <NA>
3                                               #Followme     3.6
4                                             #FullMethod    <NA>
5                                                   #Like     4.9
6                                             #SquadGoals     4.1

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/392225.html

標籤：html r 网页抓取背心数据库

上一篇：為什么這個htmljavascript檔案不起作用，我怎樣才能讓用戶輸入不是正數？

下一篇：Django錯誤的檔案路徑/重定向和傳遞資料