我是網路抓取的新手,R我需要幫助來完成這項任務。我正在嘗試從這個特定的網頁中抓取資料,但在整個程序中我被困在了一個特定的點上。
這是網址:網頁
基本上,我試圖從網頁中捕獲 3 個元素:
(1)房間型別(CSS選擇:.room h3)
(2)膳食安排(CSS選擇:.meal-plan-title)
(3)價格(CSS選擇:.price)
我已經能夠從網頁中提取這些值。但是,我很難匹配網頁上顯示的值。
以下是我的R代碼的立場:
library(rvest)
library(dplyr)
library(stringr)
library(tables)
MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()
MealPlan
Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
html_nodes(".price") %>%
html_text()
Price
RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=") %>%
html_nodes(".room h3") %>%
html_text()
RoomType
我想在資料框中輸出如下:
RoomType MealPlan Price
Chambre Standard Petit Dej. Diner 584 € / pers
Chambre Standard All inclusive 864 € / pers
Chambre Confort Petit Dej. Diner 715 € / pers
Chambre Confort All inclusive 995 € / pers
Bungalow Petit Dej. Diner 781 € / pers
Bungalow All inclusive 1061 € / pers
Chambre Deluxe Petit Dej. Diner 847 € / pers
Chambre Deluxe All inclusive 1127 € / pers
任何幫助將不勝感激。
uj5u.com熱心網友回復:
一種較慢的方法來回答。我添加了trim = TRUE洗掉額外空格的屬性。
一個問題MealPlan是有幾個 class .noprice。排除它們的另一種方法是使用xpathinhtml_nodes而不是 CSS 選擇器。我不知道是否有辦法用 CSS 選擇器來做到這一點。我在下面所做的是提取兩者,然后對它們進行一組差異。
對于價格,我使用正則運算式去除了價格中的額外空間。
library(rvest)
library(dplyr)
library(stringr)
url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]="
Price <- read_html(url) %>%
html_nodes(".price") %>%
html_text(trim = TRUE) %>%
str_replace("(\\d)\\s(\\d)", "\\1\\2")
RoomType <- read_html(url) %>%
html_nodes(".room h3") %>%
html_text(trim = TRUE)
AllMealPlans <- read_html(url) %>%
html_nodes(".meal-plan-text") %>%
html_text(trim = TRUE)
MealPlansNoPrice <- read_html(url) %>%
html_nodes(".noprice .meal-plan-text") %>%
html_text(trim = TRUE)
MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice)
NumberMealPlans <- length(MealPlan)
NumberRoomTypes <- length(RoomType)
MealPlanColumn <- MealPlan %>% rep(times=NumberRoomTypes)
RoomTypeColumn <- RoomType %>%
rep(each = NumberMealPlans)
bind_cols(RoomType = RoomTypeColumn, MealPlan = MealPlanColumn, Price = Price)
uj5u.com熱心網友回復:
您可以使用map_dfrfrompurrr為膳食計劃生成具有單獨列的寬 DataFrame,然后pivot_longer將它們放入包含值的價格資訊的列中。您傳入的初始串列map_dfr將是代表每個房間串列的父元素,使用 css selector 收集.room。
提供的 url 上的所有房間都具有相同的價格條目組合,即Petit déj. diner和All inclusive。為了迎合其他頁面上的任何內容,您需要確定所有案例,或者首先將.room所有頁面中的所有 收集到一個串列中,然后使用read.dcf 之類的方法來繪制所有可能的案例并輸入N/A 缺少給定串列的地方。您需要確保為 debian 控制元件格式的 key:value 配對插入“:”。
library(rvest)
library(purrr)
library(dplyr)
library(tidyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=")
df <- map_dfr(page |> html_elements(".room"), ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text(),
`Petit Dej. Diner` = .x |> html_element(".price") |> html_text() |> trimws(),
`All inclusive` = .x |> html_element("div:nth-child(5) .price") |> html_text() |> trimws()
)) |>
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
較舊的 R 版本:
library(rvest)
library(purrr)
library(dplyr)
library(tidyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=")
df <- map_dfr(page %>% html_elements(".room"), ~
data.frame(
RoomType = .x %>% html_element("h3") %>% html_text(),
`Petit Dej. Diner` = .x %>% html_element(".price") %>% html_text() %>% trimws(),
`All inclusive` = .x %>% html_element("div:nth-child(5) .price") %>% html_text() %>% trimws()
)) %>%
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
read.dcf 處理不同價目表的示例。
對于read.dcf,我把所用的方法@akrun在他們的答案在這里,其中read.dcf用于映射出所有的膳食計劃,以優惠的價格,目前,并把NA,其中一餐計劃不存在一個給定的條目。對于 xpath,我在這里使用了@tomalak在他們的回答中給出的示例
library(tidyverse)
library(rvest)
urls <- c(
"https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea beach&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]=",
"https://www.hotelissima.fr/s/h/ile-maurice/bel-ombre/hotel-outrigger-mauritius.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=hotel outrigger mauritius&startdate=08/11/2021&stopdate=15/11/2021&duration=7&travelers=En couple&travelType=&rooms[0].nbAdults=2&rooms[0].nbChilds=0&rooms[0].birthdates[0]=&rooms[0].birthdates[1]=&rooms[0].birthdates[2]=&rooms[0].birthdates[3]=&rooms[0].birthdates[4]="
)
entries <- purrr::map(urls, ~ read_html(.x) |> html_elements(".room")) |> unlist(recursive = F)
meal_df <- map_dfr(entries, ~ {
prices <- .x %>%
html_elements(".price") %>%
html_text(trim = T)
meal_text <- .x %>%
html_elements(".price") |>
html_elements(xpath = "./ancestor::div[contains(concat(' ', @class, ' '), 'row')][1]//h4[@class='meal-plan-text']") |>
html_text(trim = T)
new <- paste(meal_text, prices, sep = ":")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
})
df <- map_df(entries, ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text()
))
listings <- cbind(df, meal_df)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/317054.html
