我的輸入:
library(tidyverse)
library(stringi)
tdf<-data.frame("foo"=c('|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|TI'),
"bar"=c('|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI'),
"xyz" = c('|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ReviewNG-ICV|TI|BB.2',
'|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ICV'),
"gaz" = c('|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI',
'|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI',
'|NG-BB.2|ICV|AI|TI','|BB.2'))
我嘗試計算我的每個標簽出現的次數tdf,所有標簽都有 4 個“形式”:出現的總數,ReviewNG-label,NG-label至少是“純” |label, |label|。例如 label AI, have allmatchs total , have ReviewNG-AI,NG-AI,和|AI或|AI|純形式。這樣我的代碼:
pt_t <- c("AI" )
sum(stringi::stri_count_fixed(tdf, regex(pt_t)))
pt_rng <- c("ReviewNG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_rng)))
pt_ng<-c("NG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_ng)))
pt<-c("|AI","|AI|")
sum(stringi::stri_count_fixed(tdf, regex(pt)))
我的輸出:
Warning in stringi::stri_count_fixed(tdf, regex(pt_t)) :
argument is not an atomic vector; coercing
[1] 30
Warning in stringi::stri_count_fixed(tdf, regex(pt_rng)) :
argument is not an atomic vector; coercing
[1] 7
Warning in stringi::stri_count_fixed(tdf, regex(pt_ng)) :
argument is not an atomic vector; coercing
[1] 14
Warning in stringi::stri_count_fixed(tdf, regex(pt)) :
argument is not an atomic vector; coercing
[1] 15
首先,我不完全理解警告資訊。現在讓我們看一個計數:總體來說ReviewNG-AI還好,仍然很好。但接下來有一個問題:因為NG-AI我理解的是 double count NGplus ReviewNG,最后一個“純”計數,因為|AI' or '|AI|我完全不明白它是如何相等的 15,而我手動計數的是 16。
我也在嘗試stringr,tidyverse但這里的輸出確實錯誤:
sum(str_count(tdf,pt))
res<-tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt)))))
rowSums(res)
uj5u.com熱心網友回復:
也許這種解決方案。正如馬丁已經解釋了為什么以及如何我們可以采取不同的策略。如果所有標簽都由|
我們可以pivot_longer和count他們。根據您所需的輸出:
library(dplyr)
library(tidyr)
tdf %>%
pivot_longer(
everything()
) %>%
mutate(value = sub('\\|', '', value)) %>%
separate_rows(value, sep = "\\|") %>%
group_by(name, value) %>%
summarise(Labels = n())
name value Labels
<chr> <chr> <int>
1 bar AI 12
2 bar BB.2 7
3 bar ReviewNG-BB.2 4
4 foo NG-BB.3 6
5 foo ReviewNG-BB.2 11
6 foo ReviewNG-BB.3 5
7 foo TI 1
8 gaz AI 4
9 gaz BB.2 1
10 gaz BB.3 7
11 gaz ICV 4
12 gaz NG-BB.2 4
13 gaz NG-TI 7
14 gaz ReviewNG-AI 7
15 gaz TI 4
16 xyz BB.2 4
17 xyz ICV 8
18 xyz NG-AI 7
19 xyz ReviewNG-ICV 4
20 xyz TI 4
uj5u.com熱心網友回復:
您的問題是在 RegEx 中使用特殊字符:在 RegEx|中保留or。如果我們要搜索,|我們需要使用\\|. 因此,例如:
library(dplyr)
library(stringr)
pt <- c("\\|AI", "\\|AI\\|")
現在,我們要計算|AI and 的 每次出現|AI|,因此搜索模式如下所示:
paste(pt, collapse = "|")
#> [1] "\\|AI|\\|AI\\|"
所以,把它們放在一起:
tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt, collapse = "|")))))
回傳
foo bar xyz gaz
1 0 12 0 4
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/343391.html
