我有一個文本資料向量(新聞資料)。我正在嘗試掃描任何金額的文本以及圍繞該金額的文本。我使用向量的第一個元素來管理它,但很難使用回圈和串列來重復所有資料的程序。我使用str_extract_currenciesfrom stringrwhich 在檢測數字方面做得很好。使用正則運算式可能是可能的,但我不知道如何。
textdata <- data.frame(document = c(1,2),
txt = c("Outplay today announced its $7.3M series A fundraise from Sequoa Capital India. ..., which is poised to be a $5.59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management a€“ has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and $1.13 billion with hypertension across the world"))
numbers <- str_extract_currencies(textdata$txt[1]) %>%
filter(curr_sym == '$')
for (i in 1:nrow(numbers)){
print( stringr::str_extract(textdata$txt[1], paste0(".{0,20}", numbers$amount[i], ".{0,20}")))
}
finaldata <- data.frame(document = c(1,1,2),
money_related = c("oday announced its $7.3M series A fundraise",
" is poised to be a $5.59B market by 2023, is",
"with diabetes and $1.13 billion with hyper"))
一個檔案可能包含 0 個或多個金額實體。我喜歡將它存盤到這樣的 data.frame 中:
> finaldata
document money_related
1 1 oday announced its $7.3M series A fundraise
2 1 is poised to be a $5.59B market by 2023, is
3 2 with diabetes and $1.13 billion with hyper
非常感謝你。
uj5u.com熱心網友回復:
這是一個沒有 {strex} 包的 tidyverse 解決方案。但可能您需要針對您的真實資料運行它并添加其他幾種可能的情況:
library(tidyverse)
textdata %>%
rowwise(document) %>%
summarise(txt = str_extract_all(txt, ".{1,20}(\\${1}|USD)[0-9.] \\s?[A-z]?.{1,20}")) %>%
unnest_longer(txt)
#> `summarise()` has grouped output by 'document'. You can override using the `.groups` argument.
#> # A tibble: 3 x 2
#> # Groups: document [2]
#> document txt
#> <dbl> <chr>
#> 1 1 "today announced its $7.3M series A fundraise "
#> 2 1 "h is poised to be a $5.59B market by 2023, is "
#> 3 2 "e with diabetes and $1.13 billion with hypert"
由reprex 包于 2022-01-21 創建(v2.0.1)
uj5u.com熱心網友回復:
只需將您的函式包裝在一個 lapply 中:
library(dplyr)
library(strex)
library(stringr)
textdata <- data.frame(document = c(1,2),
txt = c("Outplay today announced its $7.3M series A fundraise from Sequoa Capital India. ..., which is poised to be a $5.59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management a€“ has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and $1.13 billion with hypertension across the world"))
numbers <- as.data.frame(lapply(nrow(textdata), function(x){
return(filter(str_extract_currencies(textdata[[x]]),curr_sym == '$'))
}))
numbers$string <- stringr::str_extract(numbers$string, paste0(".{0,20}", numbers$amount, ".{0,20}"))
> numbers
string_num string curr_sym amount
1 1 oday announced its $7.3M series A fundraise $ 7.30
2 1 is poised to be a $5.59B market by 2023, is $ 5.59
3 2 with diabetes and $1.13 billion with hyper $ 1.13
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/418875.html
標籤:
