在字串的R資料框列中查找最大數量-有解無憂

對于資料幀的特定列中的每個單元格（這里我們將其簡單地命名為 df），我想找到最初表示為字串的最大和最小數字的值，嵌入在字串中。單元格中存在的任何逗號都沒有特殊意義。這些數字不應該是一個百分比，因此如果例如出現 50%，那么 50 將被排除在考慮之外。資料框的相關列如下所示：

| particular_col_name | 
| ------------------- | 
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|

因此，應該創建兩個新列，標題為“maximum_number”和“minimum number”，在第一行的情況下，前者應該分別為 20 和 5。請注意，70 已被排除，因為它旁邊有 % 符號。同樣，第二行應將 40 和 4 放入新列。

我在 dplyr 'mutate' 運算子中嘗試了幾種方法（例如 str_extract_all、regmatches、strsplit），但它們要么給出錯誤訊息（特別是關于輸入列 specific_col_name），要么不以適當的格式輸出資料易于識別的最大值和最小值。

對此的任何幫助將不勝感激。

uj5u.com熱心網友回復：

library(tidyverse)

tibble(
  particular_col_name = c(
    "First Row String10. This is also a string_5, and so is this 20, exclude70%",
    "Second_Row_50%, number40. Number 4. number_15"
  )
) %>%
  mutate(
    numbers = particular_col_name %>% map(~ {
      .x %>% str_remove_all("[0-9] %") %>% str_extract_all("[0-9] ") %>% simplify() %>% as.numeric()
      }),
    min = numbers %>% map_dbl(min),
    max = numbers %>% map_dbl(max)
  ) %>%
  select(-numbers)
#> # A tibble: 2 x 3
#>   particular_col_name                                                  min   max
#>   <chr>                                                              <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e…     5    20
#> 2 Second_Row_50%, number40. Number 4. number_15                          4    40

^{由reprex 包于 2022-02-22 創建 (v2.0.0 )}

uj5u.com熱心網友回復：

我們可以str_extract_all結合使用sapply：

library(stringr)

df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) max(as.integer(x)))

  particular_col_name                                                          min   max
  <chr>                                                                      <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70%     5    70
2 Second_Row_50%, number40. Number 4. number_15                                  4    50

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/430718.html

標籤：r 细绳数据框 dplyr 数据清理

上一篇：如何使用ifelse條件回圈？

下一篇：將元組解包為'format()'字串