我在這里有這個網站:
例如,手動檢查該<div class = "cardcon">部分,我發現了我需要的以下鏈接:
# desired results
- https://www.realtor.ca/real-estate/25050003/lot-1-norcross-rd-duncan-west-duncan
- https://www.realtor.ca/real-estate/25050002/39-legacy-lane-hamilton-ancaster
- https://www.realtor.ca/real-estate/25049996/53-16-fourth-st-orangeville-orangeville
- etc.
我注意到所有這些所需的鏈接都包含在以下型別的 HTML 結構中:<a href="*****INSERT LINK HERE****" data-binding="href=DetailsURL" target="_blank">
我有以下問題:使用 R 編程語言,是否可以保存此<a href = .... target="_blank"> 結構中包含的此頁面上的每個鏈接?
例如 - 我在這里嘗試了這段代碼:
library(rvest)
library(httr)
library(XML)
url<-"https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434,-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD"
# making http request
resource <- GET(url)
# converting all the data to HTML format
parse <- htmlParse(resource)
# scrapping all the href tags
links <- xpathSApply(parse, path="//a", xmlGetAttr, "href")
page <-read_html(links)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
但我不確定如何完成這個。
有人可以告訴我下一步該怎么做嗎?
謝謝!
uj5u.com熱心網友回復:
即使鏈接是 URL 編碼的,最好還是呼叫他們的 API。查看網路部分 - 你會發現:

在您的 URL 中編碼的引數可以在payload選項卡中找到。有了httr2您可以檢索與網站相同的資訊。
圖書館(tidyverse) 圖書館(httr2)
content <- "https://api2.realtor.ca/Listing.svc/PropertySearch_Post" %>%
request() %>%
req_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36") %>%
req_body_form(
ZoomLevel = 4,
LatitudeMax = '67.16743',
LongitudeMax = '-56.40166',
LatitudeMin = '-5.70993',
LongitudeMin = '-139.10674',
CurrentPage = 2,
Sort = '6-D',
PropertyTypeGroupID = 1,
ropertySearchTypeId = 1,
TransactionTypeId = 2,
Currency = 'CAD',
RecordsPerPage = 36,
ApplicationId = 1,
CultureId = 1,
Version = '7.0'
) %>%
req_headers('referer' = 'https://www.realtor.ca/') %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
content %>%
getElement('Results') %>%
as_tibble
# A tibble: 12 x 21
Id MlsNum~1 Publi~2 Build~3 Indiv~4 Prope~5 Busin~6 Land$~7 Posta~8 Relat~9 Statu~*
<chr> <chr> <chr> <chr> <list> <chr> <df[,0> <chr> <chr> <chr> <chr>
1 25049990 W5821020 "Rare ~ 3 <df> $1,275~ 30.2 x~ L9E1J1 /real-~ 1
2 25049994 W5821034 "The B~ 3 <df> $1,499~ 29.5 x~ L6M2Z8 /real-~ 1
3 25049980 N5821033 "Immac~ 3 <df> $999,0~ 20.01 ~ L4S2K9 /real-~ 1
4 25049978 N5821026 "Ravin~ 3 <df> $990,0~ 24.64 ~ L6B0G6 /real-~ 1
5 25049977 N5821022 "A Rea~ 2 <df> $599,9~ NA L3T4S3 /real-~ 1
6 25049976 N5821019 "**** ~ 4 <df> $1,468~ 40.03 ~ L3X2H9 /real-~ 1
7 25049973 E5821030 "7 Yea~ 4 <df> $1,799~ 151.71~ L0B1A0 /real-~ 1
8 25049971 E5821014 "This ~ 3 <df> $849,0~ 27.1 x~ M4J4C3 /real-~ 1
9 25049966 C5821039 "Brigh~ 1 <df> $568,8~ NA M2N0L2 /real-~ 1
10 25049967 C5821042 "Come ~ 1 <df> $599,0~ NA M5V0G8 /real-~ 1
11 25049965 C5821029 "Wow! ~ 1 <df> $514,9~ NA M3C1S5 /real-~ 1
12 25049963 C5821025 "High ~ 2 <df> $1,199~ NA M5C0A6 /real-~ 1
# ... with 26 more variables: Building$Bedrooms <chr>, $StoriesTotal <chr>, $Type <chr>,
# $Ammenities <chr>, Property$Type <chr>, $Address <df[,5]>, $Photo <list>,
# $Parking <list>, $ParkingSpaceTotal <chr>, $TypeId <chr>, $OwnershipType <chr>,
# $ConvertedPrice <chr>, $OwnershipTypeGroupIds <list>, $ParkingType <chr>,
# $PriceUnformattedValue <chr>, $AmmenitiesNearBy <chr>, PhotoChangeDateUTC <chr>,
# HasNewImageUpdate <lgl>, Distance <chr>, RelativeURLEn <chr>, RelativeURLFr <chr>,
# Media <list>, InsertedDateUTC <chr>, TimeOnRealtor <chr>, Tags <list>, ...
# i Use `colnames()` to see all variable names
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/531622.html
標籤:htmlrxml网页抓取
上一篇:如何使用XPath選擇子節點
