WebscrapingR：沒有適用于“read_xml”的方法應用于“list”類的物件-有解無憂

我在這里有這個網站： Webscraping R：沒有適用于“read_xml”的方法應用于“list”類的物件

例如，手動檢查該<div class = "cardcon">部分，我發現了我需要的以下鏈接：

# desired results
- https://www.realtor.ca/real-estate/25050003/lot-1-norcross-rd-duncan-west-duncan
- https://www.realtor.ca/real-estate/25050002/39-legacy-lane-hamilton-ancaster
- https://www.realtor.ca/real-estate/25049996/53-16-fourth-st-orangeville-orangeville
- etc.

我注意到所有這些所需的鏈接都包含在以下型別的 HTML 結構中：<a href="*****INSERT LINK HERE****" data-binding="href=DetailsURL" target="_blank">

我有以下問題：使用 R 編程語言，是否可以保存此<a href = .... target="_blank"> 結構中包含的此頁面上的每個鏈接？

例如 - 我在這里嘗試了這段代碼：

library(rvest)
library(httr)
library(XML)

url<-"https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434,-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD"

# making http request
resource <- GET(url)

# converting all the data to HTML format
parse <- htmlParse(resource)

# scrapping all the href tags
links <- xpathSApply(parse, path="//a", xmlGetAttr, "href")

page <-read_html(links)

Error in UseMethod("read_xml") : 
  no applicable method for 'read_xml' applied to an object of class "list"

但我不確定如何完成這個。

有人可以告訴我下一步該怎么做嗎？

謝謝！

uj5u.com熱心網友回復：

即使鏈接是 URL 編碼的，最好還是呼叫他們的 API。查看網路部分 - 你會發現：

Webscraping R：沒有適用于“read_xml”的方法應用于“list”類的物件

在您的 URL 中編碼的引數可以在payload選項卡中找到。有了httr2您可以檢索與網站相同的資訊。

圖書館（tidyverse）圖書館（httr2）

content <- "https://api2.realtor.ca/Listing.svc/PropertySearch_Post" %>%
  request() %>%
  req_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36") %>% 
  req_body_form(
    ZoomLevel = 4,
    LatitudeMax = '67.16743',
    LongitudeMax = '-56.40166',
    LatitudeMin = '-5.70993',
    LongitudeMin = '-139.10674',
    CurrentPage = 2,
    Sort = '6-D',
    PropertyTypeGroupID = 1,
    ropertySearchTypeId = 1,
    TransactionTypeId = 2,
    Currency = 'CAD',
    RecordsPerPage = 36,
    ApplicationId = 1,
    CultureId = 1,
    Version = '7.0'
  ) %>%
  req_headers('referer' = 'https://www.realtor.ca/') %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE)

content %>% 
  getElement('Results') %>%  
  as_tibble

# A tibble: 12 x 21
   Id       MlsNum~1 Publi~2 Build~3 Indiv~4 Prope~5 Busin~6 Land$~7 Posta~8 Relat~9 Statu~*
   <chr>    <chr>    <chr>   <chr>   <list>  <chr>   <df[,0> <chr>   <chr>   <chr>   <chr>  
 1 25049990 W5821020 "Rare ~ 3       <df>    $1,275~         30.2 x~ L9E1J1  /real-~ 1      
 2 25049994 W5821034 "The B~ 3       <df>    $1,499~         29.5 x~ L6M2Z8  /real-~ 1      
 3 25049980 N5821033 "Immac~ 3       <df>    $999,0~         20.01 ~ L4S2K9  /real-~ 1      
 4 25049978 N5821026 "Ravin~ 3       <df>    $990,0~         24.64 ~ L6B0G6  /real-~ 1      
 5 25049977 N5821022 "A Rea~ 2       <df>    $599,9~         NA      L3T4S3  /real-~ 1      
 6 25049976 N5821019 "**** ~ 4       <df>    $1,468~         40.03 ~ L3X2H9  /real-~ 1      
 7 25049973 E5821030 "7 Yea~ 4       <df>    $1,799~         151.71~ L0B1A0  /real-~ 1      
 8 25049971 E5821014 "This ~ 3       <df>    $849,0~         27.1 x~ M4J4C3  /real-~ 1      
 9 25049966 C5821039 "Brigh~ 1       <df>    $568,8~         NA      M2N0L2  /real-~ 1      
10 25049967 C5821042 "Come ~ 1       <df>    $599,0~         NA      M5V0G8  /real-~ 1      
11 25049965 C5821029 "Wow! ~ 1       <df>    $514,9~         NA      M3C1S5  /real-~ 1      
12 25049963 C5821025 "High ~ 2       <df>    $1,199~         NA      M5C0A6  /real-~ 1      
# ... with 26 more variables: Building$Bedrooms <chr>, $StoriesTotal <chr>, $Type <chr>,
#   $Ammenities <chr>, Property$Type <chr>, $Address <df[,5]>, $Photo <list>,
#   $Parking <list>, $ParkingSpaceTotal <chr>, $TypeId <chr>, $OwnershipType <chr>,
#   $ConvertedPrice <chr>, $OwnershipTypeGroupIds <list>, $ParkingType <chr>,
#   $PriceUnformattedValue <chr>, $AmmenitiesNearBy <chr>, PhotoChangeDateUTC <chr>,
#   HasNewImageUpdate <lgl>, Distance <chr>, RelativeURLEn <chr>, RelativeURLFr <chr>,
#   Media <list>, InsertedDateUTC <chr>, TimeOnRealtor <chr>, Tags <list>, ...
# i Use `colnames()` to see all variable names

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/531622.html

標籤：htmlrxml网页抓取

上一篇：如何使用XPath選擇子節點

下一篇：pythonxmlrpc服務器無法從其他計算機接收任何XMLRPC