在R中的文本資料向量中查找美元金額之前和之后的字符-有解無憂

我有一個文本資料向量（新聞資料）。我正在嘗試掃描任何金額的文本以及圍繞該金額的文本。我使用向量的第一個元素來管理它，但很難使用回圈和串列來重復所有資料的程序。我使用str_extract_currenciesfrom stringrwhich 在檢測數字方面做得很好。使用正則運算式可能是可能的，但我不知道如何。

textdata <- data.frame(document = c(1,2),
                       txt = c("Outplay today announced its $7.3M series A fundraise from Sequoa Capital India. ..., which is poised to be a $5.59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management a€“ has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and $1.13  billion with hypertension across the world"))

numbers <- str_extract_currencies(textdata$txt[1]) %>% 
  filter(curr_sym == '$')

for (i in 1:nrow(numbers)){
  print( stringr::str_extract(textdata$txt[1], paste0(".{0,20}", numbers$amount[i], ".{0,20}")))
}

finaldata <- data.frame(document = c(1,1,2),
                        money_related = c("oday announced its $7.3M series A fundraise",
                                          " is poised to be a $5.59B market by 2023, is",
                                          "with diabetes and $1.13  billion with hyper"))

一個檔案可能包含 0 個或多個金額實體。我喜歡將它存盤到這樣的 data.frame 中：

> finaldata
  document                                money_related
1        1  oday announced its $7.3M series A fundraise
2        1  is poised to be a $5.59B market by 2023, is
3        2  with diabetes and $1.13  billion with hyper

非常感謝你。

uj5u.com熱心網友回復：

這是一個沒有 {strex} 包的 tidyverse 解決方案。但可能您需要針對您的真實資料運行它并添加其他幾種可能的情況：

library(tidyverse)

textdata %>% 
  rowwise(document) %>% 
  summarise(txt = str_extract_all(txt, ".{1,20}(\\${1}|USD)[0-9.] \\s?[A-z]?.{1,20}")) %>% 
  unnest_longer(txt)

#> `summarise()` has grouped output by 'document'. You can override using the `.groups` argument.
#> # A tibble: 3 x 2
#> # Groups:   document [2]
#>   document txt                                             
#>      <dbl> <chr>                                           
#> 1        1 "today announced its $7.3M series A fundraise " 
#> 2        1 "h is poised to be a $5.59B market by 2023, is "
#> 3        2 "e with diabetes and $1.13  billion with hypert"

^{由reprex 包于 2022-01-21 創建(v2.0.1)}

uj5u.com熱心網友回復：

只需將您的函式包裝在一個 lapply 中：

library(dplyr)
library(strex)
library(stringr)

textdata <- data.frame(document = c(1,2),
                    txt = c("Outplay today announced its $7.3M series A fundraise from Sequoa Capital India. ..., which is poised to be a $5.59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management a€“ has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and $1.13  billion with hypertension across the world"))


numbers <- as.data.frame(lapply(nrow(textdata), function(x){
  return(filter(str_extract_currencies(textdata[[x]]),curr_sym == '$'))
}))
numbers$string <- stringr::str_extract(numbers$string, paste0(".{0,20}", numbers$amount, ".{0,20}"))

> numbers
  string_num                                       string curr_sym amount
1          1  oday announced its $7.3M series A fundraise        $   7.30
2          1  is poised to be a $5.59B market by 2023, is        $   5.59
3          2  with diabetes and $1.13  billion with hyper        $   1.13

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/418875.html

標籤：

上一篇：在PandasDataframe中加入列后獲取唯一字串

下一篇：正則運算式-從降價字串中提取所有標題