針對另一個資料集提取和評估文本字串中的單詞-有解無憂

我有兩組資料，我將對其進行評估。一個大大簡化的示例如下所示：

library(dplyr)
library(tidyverse)
library(sqldf)
library(dbplyr)
library(httr)
library(purrr)
library(jsonlite)
library(magrittr)
library(tidyr)
library(tidytext)

    people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c("Anna Wilson", 
                                                                           "Jeff Smith", "Craig Mills"), biography = c("Student at Ohio State University. Class of 2024.", 
                                                                                                                       "Second year law student at Stanford. Undergrad at William & Mary", 
                                                                                                                       "University of North Texas Volleyball!")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                      -3L))
    
    
    college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c("Ohio State University", 
                                                                                                  "Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus", 
                                                                                                                                                                               "Stanford", "Williamsburg", "Denton"), college_state = c("OH", 
                                                                                                                                                                                                                                        "CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
                                                                                                                                                                                                                                        ))

我正在嘗試針對in中的biography文本字串的內容創建一個匹配項，因此最終輸出將如下所示：people_records_excollege_namecollege_records_ex

    final_records_ex <- structure(list(id = c(123L, 456L, 456L, 789L), name = c("Anna Wilson", 
                                                                                "Jeff Smith", "Jeff Smith", "Craig Mills"), college_name = c("Ohio State University", 
                                                                                                                                             "Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus", 
                                                                                                                                                                                                                          "Stanford", "Williamsburg", "Denton"), college_state = c("OH", 
                                                                                                                                                                                                                                                                                   "CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
                                                                                                                                                                                                                                                                                   ))

或者提供我期望的最終輸出的更直觀的示例：

針對另一個資料集提取和評估文本字串中的單詞

但是當我運行以下代碼時，它會產生零結果，這是不正確的：

college_extract <- people_records_ex %>%
  left_join(college_records_ex, by = c("biography" = "college_name")) %>%
  filter(!is.na(college_state)) %>% dplyr::select(id, name, college_name, college_city, college_state) %>% distinct()

我做錯了什么，正確的版本會是什么樣子？

uj5u.com熱心網友回復：

這是一個非常整潔和直接的解決方案fuzzy_join：

library(fuzzyjoin)
library(stringr)
library(dplyr)
fuzzy_join(
  people_records_ex, college_records_ex,
  by  = c("biography" = "college_name"),
  match_fun = str_detect,
  mode = "left"
) %>%
select(-biography)
   id        name college_id              college_name college_city college_state
1 123 Anna Wilson        234     Ohio State University     Columbus            OH
2 456  Jeff Smith        567                  Stanford     Stanford            CA
3 456  Jeff Smith        891            William & Mary Williamsburg            VA
4 789 Craig Mills        345 University of North Texas       Denton            TX

uj5u.com熱心網友回復：

假設傳記中的大學名稱的拼寫與大學表中出現的完全一樣，并且資料集相對較小，則可以使用所有大學名稱的正則運算式生成所有匹配項，如下所示

library(dplyr)

people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c(
  "Anna Wilson",
  "Jeff Smith", "Craig Mills"
), biography = c(
  "Student at Ohio State University. Class of 2024.",
  "Second year law student at Stanford. Undergrad at William & Mary",
  "University of North Texas Volleyball!"
)), class = "data.frame", row.names = c(
  NA,
  -3L
)) %>% tibble::tibble()


college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c(
  "Ohio State University",
  "Stanford", "William & Mary", "University of North Texas"
), college_city = c(
  "Columbus",
  "Stanford", "Williamsburg", "Denton"
), college_state = c(
  "OH",
  "CA", "VA", "TX"
)), class = "data.frame", row.names = c(NA, -4L)) %>%
  tibble::tibble()

# join college names in a regex pattern
colleges_regex <- paste0(college_records_ex$college_name, collapse = "|")

colleges_regex
#> [1] "Ohio State University|Stanford|William & Mary|University of North Texas"

# match all against bio, giving a list-column of matches
people_records_ex %>%
  mutate(matches = stringr::str_match_all(biography, colleges_regex))
#> # A tibble: 3 × 4
#>      id name        biography                                           matches 
#>   <int> <chr>       <chr>                                               <list>  
#> 1   123 Anna Wilson Student at Ohio State University. Class of 2024.    <chr[…]>
#> 2   456 Jeff Smith  Second year law student at Stanford. Undergrad at … <chr[…]>
#> 3   789 Craig Mills University of North Texas Volleyball!               <chr[…]>

# unnest the list column wider to give 1 row per person per match
people_records_ex %>%
  mutate(matches = stringr::str_match_all(biography, colleges_regex)) %>%
  tidyr::unnest_longer(matches)
#> # A tibble: 4 × 4
#>      id name        biography                                            match…1
#>   <int> <chr>       <chr>                                                <chr>  
#> 1   123 Anna Wilson Student at Ohio State University. Class of 2024.     Ohio S…
#> 2   456 Jeff Smith  Second year law student at Stanford. Undergrad at W… Stanfo…
#> 3   456 Jeff Smith  Second year law student at Stanford. Undergrad at W… Willia…
#> 4   789 Craig Mills University of North Texas Volleyball!                Univer…
#> # … with abbreviated variable name 1?matches[,1]

^{使用reprex v2.0.2創建于 2022-10-26}

這可以連接回大學表，以便用大學資訊對其進行注釋。

uj5u.com熱心網友回復：

在基礎 R 中，您可以執行以下操作：

do.call(rbind, lapply(college_records_ex$college_name, 
                      \(x) people_records_ex[grep(x, people_records_ex$biography),1:2])) |> 
  cbind(college_records_ex[-1])

這做了一些匹配，我對前兩列進行了子集化，即 id 和 name，將它與第二個 data.frame 系結，擺脫了第一列

    id        name              college_name college_city college_state
1  123 Anna Wilson     Ohio State University     Columbus            OH
2  456  Jeff Smith                  Stanford     Stanford            CA
21 456  Jeff Smith            William & Mary Williamsburg            VA
3  789 Craig Mills University of North Texas       Denton            TX

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/520911.html

標籤：rdplyr

上一篇：Geom_label_repel：如何將標簽拉向散點圖的4個角

下一篇：Flextable選擇器：為i和j引數創建一個函式