我有一個資料集,由特定日期的平均值組成。請參閱下面的資料集示例:
structure(list(startdate = structure(c(14951, 14958, 14965, 14978,
14985, 14992), class = "Date"), enddate = structure(c(14957,
14964, 14971, 14985, 14992, 14999), class = "Date"), Conc = c(5.873,
14.591, 8.854, NA, 20.228, 74.57)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
我想根據上述資料集生成由日常資料組成的新資料集。請在下面查看我想要的新資料集:
structure(list(Date = structure(c(14951, 14952, 14953, 14954,
14955, 14956, 14957, 14958, 14959, 14960, 14961, 14962, 14963,
14964, 14965, 14966, 14967, 14968, 14969, 14970, 14971, 14972,
14973, 14974, 14975, 14976, 14977, 14978, 14979, 14980, 14981,
14982, 14983, 14984, 14985, 14986, 14987, 14988, 14989, 14990,
14991, 14992, 14993, 14994, 14995, 14996, 14997, 14998), class = "Date"),
Conc = c(5.873, 5.873, 5.873, 5.873, 5.873, 5.873, 5.873,
14.591, 14.591, 14.591, 14.591, 14.591, 14.591, 14.591, 8.854,
8.854, 8.854, 8.854, 8.854, 8.854, 8.854, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.228, 20.228, 20.228,
20.228, 20.228, 20.228, 20.228, 74.57, 74.57, 74.57, 74.57,
74.57, 74.57, 74.57)), row.names = c(NA, -48L), class = c("tbl_df",
"tbl", "data.frame"))
我可以做相反的事情(計算特定日期的平均值),但我不知道如何做相反的事情。我有一個大資料集,如果我能做到這一點,它將挽救我的生命。請任何人都可以幫助我嗎?先感謝您。
uj5u.com熱心網友回復:
我們首先創建一個 data.frame 包含第一個startdate和最后一個之間的所有天數enddate。然后我們加入平均值 data.frame 并按可用值的每個“部分”分組Conc。在每一組中,我們檢查,如果日期是中startdate和enddate,如果是與第一個值替換值Conc。
library(dplyr)
library(tidyr)
data.frame(startdate = seq(min(df1$startdate), max(df1$enddate), by=1)) %>%
left_join(df1) %>%
group_by(cumsum(!is.na(Conc))) %>%
mutate(Conc = ifelse(startdate <= first(enddate), first(Conc), NA)) %>%
ungroup() %>%
select(Date = startdate, Conc)
回傳:
Date Conc
1 2010-12-08 5.873
2 2010-12-09 5.873
3 2010-12-10 5.873
4 2010-12-11 5.873
5 2010-12-12 5.873
6 2010-12-13 5.873
7 2010-12-14 5.873
8 2010-12-15 14.591
9 2010-12-16 14.591
10 2010-12-17 14.591
11 2010-12-18 14.591
12 2010-12-19 14.591
13 2010-12-20 14.591
14 2010-12-21 14.591
15 2010-12-22 8.854
16 2010-12-23 8.854
17 2010-12-24 8.854
18 2010-12-25 8.854
19 2010-12-26 8.854
20 2010-12-27 8.854
21 2010-12-28 8.854
22 2010-12-29 NA
23 2010-12-30 NA
24 2010-12-31 NA
25 2011-01-01 NA
26 2011-01-02 NA
27 2011-01-03 NA
28 2011-01-04 NA
29 2011-01-05 NA
30 2011-01-06 NA
31 2011-01-07 NA
32 2011-01-08 NA
33 2011-01-09 NA
34 2011-01-10 NA
35 2011-01-11 20.228
36 2011-01-12 20.228
37 2011-01-13 20.228
38 2011-01-14 20.228
39 2011-01-15 20.228
40 2011-01-16 20.228
41 2011-01-17 20.228
42 2011-01-18 74.570
43 2011-01-19 74.570
44 2011-01-20 74.570
45 2011-01-21 74.570
46 2011-01-22 74.570
47 2011-01-23 74.570
48 2011-01-24 74.570
49 2011-01-25 74.570
注意:您的預期輸出似乎缺少 2011-01-25 的最后一個值
uj5u.com熱心網友回復:
這是一個tidyverse利用tidyr::separate_rows()字串操作的解決方案,不需要任何joining 或grouping。然而,對于很長的時間視窗,其中startdate和enddate相距很遠,當嘗試將每個日期序列塞入單個字串時,此解決方案可能會證明計算效率低下 - 甚至會導致記憶體限制。
解決方案
library(tidyverse)
# ...
# Code to generate your first dataframe 'dataset_1'.
# ...
results <- dataset_1 %>%
# Create a 'Date' column, where each value is a string enumerating each day (as an
# integer) between 'startdate' and 'enddate':
# "14951 14952 14953 14954 14955 14956 14957"
# "14958 14959 14960 14961 14962 14963 14964"
# ?
rowwise() %>% mutate(Date = paste(startdate:enddate, collapse = " ")) %>%
# Use 'tidyr::separate_rows()' to pivot each day into its own row.
separate_rows(Date, sep = " ", convert = TRUE) %>%
# Format each day as a 'Date' object.
mutate(Date = structure(Date, class = "Date")) %>%
# Format the dataset as desired.
select(Date, Conc)
# View results.
print(results, n = Inf)
結果
鑒于dataset_1你復制的那個
dataset_1 <- structure(
list(
startdate = structure(
c(14951, 14958, 14965, 14978, 14985, 14992),
class = "Date"
),
enddate = structure(
c(14957, 14964, 14971, 14985, 14992, 14999),
class = "Date"
),
Conc = c(5.873, 14.591, 8.854, NA, 20.228, 74.57)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
)
此解決方案應產生以下結果results:
# A tibble: 45 x 2
Date Conc
<date> <dbl>
1 2010-12-08 5.87
2 2010-12-09 5.87
3 2010-12-10 5.87
4 2010-12-11 5.87
5 2010-12-12 5.87
6 2010-12-13 5.87
7 2010-12-14 5.87
8 2010-12-15 14.6
9 2010-12-16 14.6
10 2010-12-17 14.6
11 2010-12-18 14.6
12 2010-12-19 14.6
13 2010-12-20 14.6
14 2010-12-21 14.6
15 2010-12-22 8.85
16 2010-12-23 8.85
17 2010-12-24 8.85
18 2010-12-25 8.85
19 2010-12-26 8.85
20 2010-12-27 8.85
21 2010-12-28 8.85
22 2011-01-04 NA
23 2011-01-05 NA
24 2011-01-06 NA
25 2011-01-07 NA
26 2011-01-08 NA
27 2011-01-09 NA
28 2011-01-10 NA
29 2011-01-11 NA
30 2011-01-11 20.2
31 2011-01-12 20.2
32 2011-01-13 20.2
33 2011-01-14 20.2
34 2011-01-15 20.2
35 2011-01-16 20.2
36 2011-01-17 20.2
37 2011-01-18 20.2
38 2011-01-18 74.6
39 2011-01-19 74.6
40 2011-01-20 74.6
41 2011-01-21 74.6
42 2011-01-22 74.6
43 2011-01-23 74.6
44 2011-01-24 74.6
45 2011-01-25 74.6
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/327677.html
標籤:r
上一篇:對于兩個相同大小的資料幀df和logicaldf,命令df[logicaldf,]在R中執行什么?
下一篇:從檔案名中提取字串/數字
