我正在分析公司會議,我想衡量會議中的人們何時提出某些主題。時間意味著詞的位置。
例如,在三個會議中,人們什么時候會在我的字典中提到“工會化”和其他詞?
df <- data.frame(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- c("unions", "strike", "unionizing")
期望的輸出:
| 文本 | 數數 | 單詞 |
|---|---|---|
| 我們今天在這里見面... | (詞的位置) | 工會化 |
| 大家好,聯合... | (詞的位置) | 工會化 |
| 大家好,聯合... | (詞的位置) | 罷工 |
| 大家好,聯合... | (詞的位置) | 工會化 |
| 我們明天討論工會... | (詞的位置) | 工會化 |
我問了一個關于第一次使用單詞的問題,在這里,我嘗試修改代碼,但沒有成功。
uj5u.com熱心網友回復:
library(tidyverse)
library(tidytext)
df <- tibble(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words = c("unions", "strike", "unionizing"))
df %>%
unnest_tokens(output = "words",
input = "text",
drop = FALSE) %>%
group_by(text) %>%
mutate(word_count = row_number()) %>%
ungroup() %>%
inner_join(dict)
#> Joining, by = "words"
#> # A tibble: 5 × 3
#> text words word_count
#> <chr> <chr> <int>
#> 1 we're meeting here today to talk about our earnings. we will… unio… 14
#> 2 hi all, unionizing and the on-going strike is at the top of … unio… 3
#> 3 hi all, unionizing and the on-going strike is at the top of … stri… 8
#> 4 hi all, unionizing and the on-going strike is at the top of … unio… 17
#> 5 we will discuss unionizing tomorrow, today the focus is our … unio… 4
由reprex 包于 2022-05-30 創建(v2.0.1)
uj5u.com熱心網友回復:
基礎 R 解決方案:
作為每個觀察的單個記錄:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\\s "
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => character vector
count = paste0(
idx,
collapse = ", "
),
# Create a vector of matched words: words => character vector
words = paste0(
vec_of_words[idx],
collapse = ", "
),
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
每個匹配的單詞都有一條新記錄:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\\s "
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => integer vector
count = idx,
# Create a vector of matched words: words => character vector
words = vec_of_words[idx],
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
uj5u.com熱心網友回復:
在 Base R 我們可以使用下面的 5 行代碼:
pat <- sprintf("\\b(%s)\\b",paste(dict, collapse = '|'))
words <- regmatches(df$text, gregexpr(pat, df$text))
loc <- Map(pmatch, words, strsplit(df$text, " "))
df1 <- stack(setNames(words, seq_along(words)))
transform(df1, location = unlist(loc), text = df$text[ind])
values ind location text
1 unionizing 1 14 we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.
2 unionizing 2 3 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
3 strike 2 7 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
4 unionizing 2 16 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
5 unionizing 3 4 we will discuss unionizing tomorrow, today the focus is our Q3 earnings
uj5u.com熱心網友回復:
使用量子:
先把標點符號化,去掉標點符號,否則標點符號會被算作一個符號。使用的好處kwic是您可以輕松查看您要查找的單詞之前和之后的單詞。
library(quanteda)
x <- kwic(tokens(df$text, remove_punct = T), dict)
data.frame(x)
docname from to pre keyword post pattern
1 text1 14 14 earnings we will also discuss unionizing efforts unionizing
2 text2 3 3 hi all unionizing and the on-going strike is unionizing
3 text2 7 7 all unionizing and the on-going strike is at the top of strike
4 text2 16 16 top of our agenda because unionizing threatens our revenue goals unionizing
5 text3 4 4 we will discuss unionizing tomorrow today the focus is unionizing
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/482646.html
