我在周末搜索了解決此問題的方法,但找不到。我確實最終制作了一個腳本,我認為它比它需要的要長得多,并且肯定有一個更快的回圈或函式方法。
我想創建一個回圈遍歷資料框 1 的每一行的函式(它是從頻率表中派生的)。該函式將使用 filter() 和 sample_n() 從資料框 2 中選擇記錄。因此,資料框 1 將作為資料框 2 的過濾和采樣標準。
請參閱下面的代碼,它不會回傳我正在搜索的記錄。
正確的結果是從 A 組 (1910) 回傳 1 條記錄,從 A 組 (1930) 回傳 3 條記錄,從 B 組 (1930) 回傳 2 條,從 C 組 (1940) 回傳 1 條,從 D 組 (1940) 回傳 1 條,從C組(1930)隨機。
干杯,
丹尼爾
require(dplyr)
FilterA <- c("A","B","A","B","C","D","C")
FilterB <- c(1910,1920,1930,1930,1940,1940,1930)
Frequency <- c(1,0,3,2,1,1,2)
df1 <- data.frame(FilterA, FilterB, Frequency)
df1$num <- paste(FilterA, FilterB, sep=" ")
ID <- c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C", "C", "C",
"D", "D", "D")
Year <- c(1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
1930, 1920, 1930, 1930,
1910, 1940, 1910, 1910, 1930, 1930,
1930, 1940, 1940)
df2 <- data.frame(ID, Year)
case.control <- function(datF1, datF2, na.rm=TRUE, ...){
ID_list <- unique(datF1$num)
for (i in seq_along(ID_list)){
func <- filter(datF2, Year == datF1$FilterB & ID == datF1$FilterA) %>% sample_n(datF1$Frequency)
func
}
}
x <- case.control(df1, df2)
uj5u.com熱心網友回復:
感謝您提供可重復的示例-如果您確實想逐行進行:
首先,我整理了您的一些代碼,這里沒有真正的變化:
# Your code----
id <- c("A", "B", "A", "B", "C", "D", "C")
year <- c(1910, 1920, 1930, 1930, 1940, 1940, 1930)
frequency <- c(1, 0, 3, 2, 1, 1, 2)
df_1 <- data.frame(id,
year,
frequency,
row.names = NULL
)
df_1$num <- paste(id, year)
df_1 <- df_1 %>%
filter(frequency != 0)
id <- c(
"A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C", "C", "C",
"D", "D", "D"
)
year <- c(
1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
1930, 1920, 1930, 1930,
1910, 1940, 1910, 1910, 1930, 1930,
1930, 1940, 1940
)
df_2 <- data.frame(id, year)
現在,繼續實際獲取隨機樣本,使用lapply(). 您可以使用apply()迭代資料框的行,但我個人覺得使用apply()令人困惑,所以我變成df_1了一個串列,使每一行都成為其中的物件。
library(dplyr)
list_1 <- split(df_1, seq(nrow(df_1)))
然后我使用lapply()以下函式迭代每一行:
# Option 1: lapply()----
random_records <- lapply(list_1, function(x) {
df_records <- df_2 %>%
# Matching up the years and id in df_2
filter(year == x$year & id == x$id) %>%
# Using the frequency with slice_sample(), sample_n() is also fine
slice_sample(n = x$frequency)
})
# Then bind the list back together again into a dataframe
random_records <- bind_rows(random_records)
或者,我個人更喜歡的另一個選項是使用purrr's map_df(),因為它會立即回傳一個資料幀。
# Option 2: purrr's map_df()
# I think this option is the neatest, because it returns a df immediately
library(purrr)
random_records <- map_df(list_1, function(x) {
df_records <- df_2 %>%
filter(year == x$year & id == x$id) %>%
slice_sample(n = x$frequency)
})
uj5u.com熱心網友回復:
Frequency您可以加入兩個資料框并根據每個唯一值的列隨機選擇行,而不是逐行FilterA處理FilterB。
library(dplyr)
df1 %>%
filter(Frequency > 0) %>%
left_join(df2, by = c('FilterA' = 'ID', 'FilterB' = 'Year')) %>%
group_by(FilterA, FilterB) %>%
sample_n(first(Frequency))
如果您df1本身擁有最終輸出所需的所有資訊,那么您可以uncount直接使用來擴展df1資料集。
這適用于共享的示例,因為df2.
tidyr::uncount(df1, Frequency)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/424857.html
上一篇:單獨的數字postgresql
下一篇:自增數函式-postgres
