在R中，如何在資料框中找到所有字典單詞的位置？-有解無憂

我正在分析公司會議，我想衡量會議中的人們何時提出某些主題。時間意味著詞的位置。

例如，在三個會議中，人們什么時候會在我的字典中提到“工會化”和其他詞？

df <- data.frame(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))

dict <- c("unions", "strike", "unionizing")

期望的輸出：

文本	數數	單詞
我們今天在這里見面...	（詞的位置）	工會化
大家好，聯合...	（詞的位置）	工會化
大家好，聯合...	（詞的位置）	罷工
大家好，聯合...	（詞的位置）	工會化
我們明天討論工會...	（詞的位置）	工會化

我問了一個關于第一次使用單詞的問題，在這里，我嘗試修改代碼，但沒有成功。

uj5u.com熱心網友回復：

library(tidyverse)
library(tidytext)

df <- tibble(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words = c("unions", "strike", "unionizing"))

df %>% 
  unnest_tokens(output = "words",
                input = "text",
                drop = FALSE) %>% 
  group_by(text) %>% 
  mutate(word_count = row_number()) %>% 
  ungroup() %>% 
  inner_join(dict)
#> Joining, by = "words"
#> # A tibble: 5 × 3
#>   text                                                          words word_count
#>   <chr>                                                         <chr>      <int>
#> 1 we're meeting here today to talk about our earnings. we will… unio…         14
#> 2 hi all, unionizing and the on-going strike is at the top of … unio…          3
#> 3 hi all, unionizing and the on-going strike is at the top of … stri…          8
#> 4 hi all, unionizing and the on-going strike is at the top of … unio…         17
#> 5 we will discuss unionizing tomorrow, today the focus is our … unio…          4

^{由reprex 包于 2022-05-30 創建(v2.0.1)}

uj5u.com熱心網友回復：

基礎 R 解決方案：

作為每個觀察的單個記錄：

# Create a regular expression to search with: 
# search_regex => character scalar
search_regex <- paste0(
  dict, 
  collapse = "|"
)

# For each observation, loop through and then flatten result into a 
# data.frame: res => data.frame
res <- do.call(
  rbind, 
  lapply(
    df$text,
    function(x){
      # Create an ordered vector of the words in observation: 
      # vec_of_words => character vector
      vec_of_words <- unlist(
        strsplit(
          x, 
          "\\s "
        )
      )
      # Compute the index where any of the search are found in the vector:
      # idx => integer vector
      idx <- which(
        grepl(
          search_regex, 
          vec_of_words, 
          ignore.case = TRUE
        )
      )
      # Create a data.frame containing the desired result: 
      # data.frame => env
      data.frame(
        # Assign the observation to the text vector: 
        # text => character vector
        text = x, 
        # Create a string containing the index of matching words: 
        # count => character vector
        count = paste0(
          idx, 
          collapse = ", "
        ), 
        # Create a vector of matched words: words => character vector
        words = paste0(
          vec_of_words[idx], 
          collapse = ", "
        ),
        row.names = NULL,
        stringsAsFactors = FALSE
      )
    }
  )
)

每個匹配的單詞都有一條新記錄：

# Create a regular expression to search with: 
# search_regex => character scalar
search_regex <- paste0(
  dict, 
  collapse = "|"
)

# For each observation, loop through and then flatten result into a 
# data.frame: res => data.frame
res <- do.call(
  rbind, 
  lapply(
    df$text,
    function(x){
      # Create an ordered vector of the words in observation: 
      # vec_of_words => character vector
      vec_of_words <- unlist(
        strsplit(
          x, 
          "\\s "
        )
      )
      # Compute the index where any of the search are found in the vector:
      # idx => integer vector
      idx <- which(
        grepl(
          search_regex, 
          vec_of_words, 
          ignore.case = TRUE
        )
      )
      # Create a data.frame containing the desired result: 
      # data.frame => env
      data.frame(
        # Assign the observation to the text vector: 
        # text => character vector
        text = x, 
        # Create a string containing the index of matching words: 
        # count => integer vector
        count = idx, 
        # Create a vector of matched words: words => character vector
        words = vec_of_words[idx],
        row.names = NULL,
        stringsAsFactors = FALSE
      )
    }
  )
)

uj5u.com熱心網友回復：

在 Base R 我們可以使用下面的 5 行代碼：

pat <- sprintf("\\b(%s)\\b",paste(dict, collapse = '|'))
words <- regmatches(df$text, gregexpr(pat, df$text))
loc <- Map(pmatch, words, strsplit(df$text, " "))
df1 <- stack(setNames(words, seq_along(words)))
transform(df1, location = unlist(loc), text = df$text[ind])

      values ind location                                                                                                                    text
1 unionizing   1       14                           we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.
2 unionizing   2        3 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
3     strike   2        7 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
4 unionizing   2       16 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
5 unionizing   3        4                                                 we will discuss unionizing tomorrow, today the focus is our Q3 earnings

uj5u.com熱心網友回復：

使用量子：

先把標點符號化，去掉標點符號，否則標點符號會被算作一個符號。使用的好處kwic是您可以輕松查看您要查找的單詞之前和之后的單詞。

library(quanteda)

x <- kwic(tokens(df$text, remove_punct = T), dict)
data.frame(x)

  docname from to                             pre    keyword                        post    pattern
1   text1   14 14   earnings we will also discuss unionizing                     efforts unionizing
2   text2    3  3                          hi all unionizing  and the on-going strike is unionizing
3   text2    7  7 all unionizing and the on-going     strike            is at the top of     strike
4   text2   16 16       top of our agenda because unionizing threatens our revenue goals unionizing
5   text3    4  4                 we will discuss unionizing tomorrow today the focus is unionizing

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/482646.html

標籤：r 文本 nlp tidyverse 量子

上一篇：如何合并資料框中具有相同前綴的行？

下一篇：在R中撰寫一個回圈用于回歸替換自變數以進行穩健性檢查