使用rvest抓取<li>專案-有解無憂

我想刮https://www.deutsche-biographie.de/。具體來說，我有興趣抓取有關每個人的以下資訊

姓名
出生年份
死亡年份
職業
出生地（源代碼中的“geburt”）和坐標
死亡地點（源代碼中的“tod”）和坐標
活動地點（源代碼中的“wirk”）和坐標

使用下面的代碼，我抓取了姓名、出生年份、死亡年份和職業。

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("#secondColumn p") %>% html_text()
result = data.frame(name, information, stringsAsFactors = FALSE)

#manipulate data in columns
result$yearofbirth = sub("(^[^-] )-.*", "\\1", result$information) #extract characters before dash
result$yearofdeath = sub(',.*$','', result$information)
result$yearofdeath = sub('.*-','', result$yearofdeath) #extract characters after dash
result$profession = sub("^.*?,", "", result$information) #extract characters after comma
result$profession = trimws(result$profession, whitespace = "[ \t\r\n]") #trim leading and trailing white space
result$information = NULL

但是，我正在努力從 <li 元素中抓取出生/死亡/活動的地點。源代碼如下所示，data-orte 代表出生/死亡/活動地點（geburt/tod/wirk），data-name 代表個人姓名。

 <li class="media treffer-liste-elem" id="treffer-sfz55763" data-orte="[email protected],9.6596678@geburt;[email protected],9.6596678@wirk;[email protected],10.1371858@wirk;[email protected],11.6399609@wirk;[email protected],12.109015599915@wirk;Frankfurt/[email protected],14.5544166@wirk;[email protected],9.54054973309832@wirk;[email protected],11.8767269@wirk;[email protected],11.3430347@wirk;[email protected],7.5969912@wirk;[email protected],20.5105165@wirk;[email protected],18.6542829@wirk;[email protected],14.4212126@wirk;[email protected],4.9001115@wirk;[email protected],8.6805975@wirk;[email protected],12.109015599915@wirk;[email protected],11.6399609@tod" data-name="Maier, Michael">

我將非常感謝有關如何刮掉這些地方的任何提示！最好的，娜塔莉

uj5u.com熱心網友回復：

我希望這個解決方案有幫助：

page %>% 
  html_elements("#secondColumn > ul") %>% 
  html_children() %>% html_attr("data-orte") %>% 
  str_split(";")

uj5u.com熱心網友回復：

實作所需結果的另一種選擇可能如下所示：

第一步類似于@Kafe提出的解決方案：從data-orte屬性中獲取地點的資訊并拆分;以獲取地點串列
第二步，我利用lapply將出生地、死亡地和活動地放在result資料框的不同列中
在第三步中，我大量使用tidyr::extract它可以很容易地從一個字串中提取多條資訊，并在一個步驟中將它們放入單獨的列中。

注意：我還使用了不同的方法來提取出生和死亡的年份。

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
  rename(years = 2, profession = 3) %>% 
  tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")

places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")

result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])

result <- result %>% 
  tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
  tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")

result
#> # A tibble: 10 × 9
#>    name   year_of_birth year_of_death profession place_of_birth place_of_birth_…
#>    <chr>  <chr>         <chr>         <chr>      <chr>          <chr>           
#>  1 Meier… 1718          1777          Philosoph  Ammendorf bei… 51.4265204,11.9…
#>  2 Meyer… 1772          1849          Jurist; B… Frankfurt/Main 50.1432793,8.68…
#>  3 Meier… 1809          1898          Bremer Ka… Bremen         53.0758099,8.80…
#>  4 Major… 1502          1574          lutherisc… Nürnberg       49.4538501,11.0…
#>  5 Meyer… 1810          1874          schweizer… Sursee Kanton… 47.1774826,8.10…
#>  6 Maier… 1568          1622          Alchemist… Rendsburg      54.3012661,9.65…
#>  7 Meier… 1692          1745          Jurist; A… Bayreuth       49.9427202,11.5…
#>  8 Mejer… 1818          1893          Jurist; P… Zellerfeld (H… 51.804126,10.33…
#>  9 Meyer… 1474          1548          Bürgermei… Basel          47.5429886,7.59…
#> 10 Hirsc… 1770          1851          Mathemati… Friesack (Mit… 52.7395263,12.5…
#> # … with 3 more variables: place_of_death <chr>, place_of_death_coord <chr>,
#> #   place_of_activity <list>

^{由reprex 包(v2.0.1)于 2021 年 11 月 21 日創建}

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/361640.html

標籤：r 网页抓取背心

上一篇：停止閃爍后，我可以讓插入符再次閃爍嗎？

下一篇：BeautifulSoup在html代碼中可以看到所有其他標簽時只提取一個標簽