根據行條件過濾資料框-有解無憂

我想到了下面的例子來說明我的問題。

假設有 5 個球：

紅色的
藍色
綠
黃色的
橘子

根據行條件過濾資料框

通常有5個！= 120 種方式可以組織這些球（n！）。我可以在下面列舉所有這些組合：

library(combinat)
library(dplyr)

my_list = c("Red", "Blue", "Green", "Yellow", "Orange")

d = permn(my_list)

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 120)) %>%
  setNames(paste0("col", 1:120))


all_combinations[,1:5]

    col1   col2   col3   col4   col5
1    Red    Red    Red    Red Orange
2   Blue   Blue   Blue Orange    Red
3  Green  Green Orange   Blue   Blue
4 Yellow Orange  Green  Green  Green
5 Orange Yellow Yellow Yellow Yellow

我的問題：

假設我想按以下條件過濾此串列：

“紅”球可以在第一個或第二個位置（從左到右）
“藍”球和“綠”球之間必須至少有 2 個位置
“黃”球不能在最后位置

然后我嘗試根據這 3 個條件過濾上述資料：

# attempt to write first condition
    cond_1 <- all_combinations[which(all_combinations[1,]== "Red" || all_combinations[2,] == "Red"), ]

#not sure how to write the second condition
    
 # attempt to write the third condition   
    cond_3 <- data_frame_version[which(data_frame_version[5,] !== "Yellow" ), ]

# if everything worked, an "anti join" style statement could be written to remove "cond_1, cond_2, cond_3" from the original data?

但這根本不起作用 - 第一個和第三個條件回傳一個資料框，所有列只包含 4 行。

有人可以告訴我如何使用上述 3 個過濾器正確過濾“all_combinations”？

筆記：

The following code can transpose the original data:

 library(data.table)

    tpose = transpose(all_combinations)

    df = tpose
    
#group every 5 rows by the same id to identify unique combinations

    bloc_len <- 5
    
    df$bloc <- 
        rep(seq(1, 1   nrow(df) %/% bloc_len), each = bloc_len, length.out = nrow(df))
    
   
 head(df)

      V1     V2     V3     V4     V5 bloc
1    Red   Blue  Green Yellow Orange    1
2    Red   Blue  Green Orange Yellow    1
3    Red   Blue Orange  Green Yellow    1
4    Red Orange   Blue  Green Yellow    1
5 Orange    Red   Blue  Green Yellow    1
6 Orange    Red   Blue Yellow  Green    2

uj5u.com熱心網友回復：

你可以做：

library(tidyverse)
tpose %>%
  mutate(blue_delete = case_when(V1 == "Blue" & V2 == "Green" ~ TRUE,
                                 V1 == "Blue" & V3 == "Green" ~ TRUE,
                                 V2 == "Blue" & V3 == "Green" ~ TRUE,
                                 V3 == "Blue" & V4 == "Green" ~ TRUE,
                                 V4 == "Blue" & V5 == "Green" ~ TRUE,
                                 TRUE ~ FALSE)) %>%
  filter(V3 != "Red" & V4 != "Red" & V5 != "Red",
         V5 != "Yellow",
         blue_delete == FALSE) %>%
  select(-blue_delete)

uj5u.com熱心網友回復：

這是一個可擴展的 tidyverse 解決方案。

首先，讓我們將資料設為 120 行的小塊，每個球的組合對應一個。

library(tidyverse)
library(combinat)
data = my_list %>% 
  permn() %>%
  map(~ set_names(.x, paste0("ball", 1:5))) %>%
  do.call(bind_rows, args = .) %>%
  mutate(id = row_number())

我們的資料：

# A tibble: 120 x 6
   ball1  ball2  ball3  ball4  ball5     id
   <chr>  <chr>  <chr>  <chr>  <chr>  <int>
 1 Red    Blue   Green  Yellow Orange     1
 2 Red    Blue   Green  Orange Yellow     2
 3 Red    Blue   Orange Green  Yellow     3
 4 Red    Orange Blue   Green  Yellow     4
 5 Orange Red    Blue   Green  Yellow     5
 6 Orange Red    Blue   Yellow Green      6
 7 Red    Orange Blue   Yellow Green      7
 8 Red    Blue   Orange Yellow Green      8
 9 Red    Blue   Yellow Orange Green      9
10 Red    Blue   Yellow Green  Orange    10
# ... with 110 more rows

該解決方案的關鍵思想是將資料轉換為長格式。這將使檢查每個條件變得微不足道。之后，我們可以將其恢復為寬幅。

data %>%
  pivot_longer(-id) %>%
  mutate(ball_number = as.numeric(str_extract(name, "[1-5]"))) %>%
  group_by(id) %>%
  filter(
    # Condition 1
    ball_number[value == "Red"] %in% c(1, 2),
    # Condition 2
    abs(ball_number[value == "Blue"] - ball_number[value == "Green"]) >= 3,
    # Condition 3
    ball_number[value == "Yellow"] != 5
  ) %>%
  select(-ball_number) %>% 
  pivot_wider(values_from = "value", names_from = "name")

輸出顯示有 10 個排列：

# A tibble: 10 x 6
# Groups:   id [10]
      id ball1 ball2 ball3  ball4  ball5 
   <int> <chr> <chr> <chr>  <chr>  <chr> 
 1     8 Red   Blue  Orange Yellow Green 
 2     9 Red   Blue  Yellow Orange Green 
 3    32 Red   Green Yellow Orange Blue  
 4    33 Red   Green Orange Yellow Blue  
 5    48 Green Red   Orange Yellow Blue  
 6    49 Green Red   Yellow Orange Blue  
 7    50 Green Red   Yellow Blue   Orange
 8   111 Blue  Red   Yellow Green  Orange
 9   112 Blue  Red   Yellow Orange Green 
10   113 Blue  Red   Orange Yellow Green

此解決方案提供的改進是，由于我們的變數，您要檢查的所有條件都非常簡單ball_number。如果有更多球，您可以輕松地將此解決方案擴展到更復雜的條件，例如前 5 個球為紅色，或者藍色球加綠色球等于 7。

uj5u.com熱心網友回復：

這是你可以做的。我知道這不是您能找到的最漂亮、最優化的解決方案。但它有效！

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
  setNames(paste0("col", 1:5))

cond_1 <- all_combinations %>%
  filter(col1 == "Red" | col2 == "Red")


cond_2 <- cond_1 %>%
    filter(col1 == "Blue" | col1 == "Green" |
             col2 == "Blue" | col2 == "Green" |
             col3 == "Blue" | col3 == "Green" |
             col4 == "Blue" | col4 == "Green" |
             col5 == "Blue" | col5 == "Green")

cond_2 <- cond_2 %>%
  mutate(cond = ifelse(col1 == 'Blue' & col4 == 'Green', 2, NA) |
           ifelse(col1 == 'Blue' & col5 == 'Green', 3, NA) |
           ifelse(col2 == 'Blue' & col5 == 'Green', 2, NA) |
           ifelse(col1 == 'Green' & col4 == 'Blue', 2, NA) |
           ifelse(col2 == 'Green' & col5 == 'Blue', 3, NA)) %>%
  filter(cond == T)


cond_3 <- cond_2%>%
  filter(col5 != "Yellow")

輸出：

  col1 col2   col3  col4 col5 cond
1 Blue  Red Orange Green  Red TRUE

uj5u.com熱心網友回復：

如果您不太關心data.frame結構，我的首選方法是將每個結果保留為串列（即您的d變數）的成員，并sapply()使用一個函式檢查該結果是否滿足所有條件。

觀察：

library(combinat)

my_list <- c("Red", "Blue", "Green", "Yellow", "Orange")
my_list_perm <- combinat::permn(my_list) 

# This function examines one particular outcome of the trial, e.g. outcome = ["Blue", "Orange", "Red", "Green", "Yellow"]
test_conditions <- function(outcome) {
  
  # Condition 1
  condition_1 <- "Red" %in% outcome[c(1,2)]
  
  # Condition 2
  condition_2 <- base::abs(base::which(outcome == "Blue") - base::which(outcome == "Green")) >= 2
  
  # Condition 3
  condition_3 <- base::which(outcome == "Yellow") != base::length(outcome)
  
  all <- condition_1 && condition_2 && condition_3
  
  return(all)
}

my_list_matches <- base::which(base::sapply(my_list_perm, test_conditions)) # applies the function to each list element (which itself is an outcome)

print(my_list_matches) # displays which trials / outcomes satisfied all conditions

#>  [1]   6   7   8   9  10  12  19  22  29  31  32  33  34  35  41  48  49  50 111 112 113 120

^{由reprex 包(v1.0.0)于 2022 年 1 月 4 日創建}

然后您可以使用匹配的索引來過濾原始串列。

uj5u.com熱心網友回復：

也許我誤讀了這個問題，但正如我所看到的，沒有一個答案似乎顯示了一個解決方案，其中在問題的第 2 步中的顏色之間有 2 列。

我冒昧地測驗了資料，發現只有當您使用“黃色”和“橙色”時，您才能找到滿足您要求的過濾條件（據我所知）。

這不是一個通用的答案，它實際上并不正確，因為“黃色”在最后一行，違反了規則，但是：

在已經考慮到最后一行的情況下，顏色之間的距離為 2 將問題減少到 4 列問題。因此只能在第 1 列和第 4 列之間實作距離為 2。這導致了 4 個假設：

第 1 列需要是“綠色”或“藍色”
第 2 列需要為“紅色”
第 3 列不應為“綠色”或“藍色”
第 4 列應該再次是“綠色”或“藍色”，但不是第 1 列

這是我想出的代碼，不漂亮，正如解釋的那樣，“綠色”和“藍色”切換到“黃色”和“橙色”，但我認為這有效。

library(combinat)
library(tidyverse)

my_list = c("Red", "Blue", "Green", "Yellow", "Orange")

d = permn(my_list)

all_combinations  = as.data.frame(matrix(unlist(d), ncol = 5)) %>%
  setNames(paste0("col", 1:5))

`%!in%` <- Negate(`%in%`)

combis <- all_combinations %>% 
  filter(col1 %in% c("Yellow", "Orange"), 
         col2 == "Red", 
         !col3 %in% c("Yellow", "Orange"), 
         col5 == "Yellow") 

results <- vector()
for(i in seq_along(combis[,1])){
  
  if(combis[i,][1] %!in% c(combis[i,][4], "Red", "Green", "Blue")){
    results <- combis[i,] 
  }
}

results

    col1 col2  col3   col4   col5
3 Yellow  Red Green Orange Yellow

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/403210.html

標籤：

上一篇：如何禁用highcharts中的隱形圖例填充？

下一篇：對于資料框中的每一行，將非NA值替換為R中直到該點的先前最大數