假設我有兩個資料框df1和df2:
df1 = structure(list(surname = c("Duisenberg", "Trichet", "Draghi"),
`start term` = structure(c(896659200, 1067644800, 1320105600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `end term` = structure(c(1067558400,
1320019200, 1572480000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(1L,
9L, 15L), class = "data.frame") %>% data.frame(stringsAsFactors = F)
surname start.term end.term
1 Duisenberg 1998-06-01 2003-10-31
9 Trichet 2003-11-01 2011-10-31
15 Draghi 2011-11-01 2019-10-31
df2= data.frame(Date = c("2010-01-01","1997-01-01","2020-01-01","2004-01-01","2012-01-01","1999-01-01","2000-01-01","2020-01-01","2022-01-01","1996-01-01"), speaker = c("Mario Draghi","W.L. Duisenberg","Ciao","Jean-Claude Trichet","M. Draghi","W.L. Duisenberg","Jean-Claude Trichet","Bye","Ciao","Mario Draghi"), stringsAsFactors = F)
Date speaker
1 2010-01-01 Mario Draghi
2 1997-01-01 W.L. Duisenberg
3 2020-01-01 Ciao
4 2004-01-01 Jean-Claude Trichet
5 2012-01-01 M. Draghi
6 1999-01-01 W.L. Duisenberg
7 2000-01-01 Jean-Claude Trichet
8 2020-01-01 Bye
9 2022-01-01 Ciao
10 1996-01-01 Mario Draghi
我可以很容易地找到其中的名稱何時df1出現df2:
which(grepl(paste0(df1$surname, collapse = "|"), df2$speaker, ignore.case = TRUE))
[1] 1 2 4 5 6 7 10
相反,更棘手的是:只有當日期在(和)的邊界之外時,df1才會出現 in的名稱。df2df2df1start.termend.term
結果應該是:
[1] 1 2 10
我該怎么做?有人可以幫我弄這個嗎?
謝謝!
uj5u.com熱心網友回復:
我認為基本上你想在這里對匹配的名稱進行連接操作。所以第一步是找出那些是什么:
library(dplyr)
surnames_regex <- paste0(df1$surname, collapse = "|")
df2$matching_name <- strsplit(df2$speaker, split = "\\s") |>
lapply(
\(name) {
matching_name <- grep(surnames_regex, name, v = T)
matching_name <- ifelse(
length(matching_name) > 0,
matching_name[1],
NA_character_
)
matching_name
}
) |>
unlist()
df2
# Date speaker matching_name
# 1 2010-01-01 Mario Draghi Draghi
# 2 1997-01-01 W.L. Duisenberg Duisenberg
# 3 2020-01-01 Ciao <NA>
# 4 2004-01-01 Jean-Claude Trichet Trichet
# 5 2012-01-01 M. Draghi Draghi
# 6 1999-01-01 W.L. Duisenberg Duisenberg
# 7 2000-01-01 Jean-Claude Trichet Trichet
# 8 2020-01-01 Bye <NA>
# 9 2022-01-01 Ciao <NA>
# 10 1996-01-01 Mario Draghi Draghi
然后,這只是加入這些名稱并根據您定義的條件進行過濾的情況:
df2 |>
inner_join(
df1,
by = c("matching_name" = "surname")
) |>
filter(
Date < start.term |
Date > end.term
)
# Date speaker matching_name start.term end.term
# 1 2010-01-01 Mario Draghi Draghi 2011-11-01 2019-10-31
# 2 1997-01-01 W.L. Duisenberg Duisenberg 1998-06-01 2003-10-31
# 3 2000-01-01 Jean-Claude Trichet Trichet 2003-11-01 2011-10-31
# 4 1996-01-01 Mario Draghi Draghi 2011-11-01 2019-10-31
uj5u.com熱心網友回復:
確保所有日期的格式一致,即所有日期都屬于 Date 類。然后這可以在單個 SQL 陳述句中完成。
library(sqldf)
df1a <- transform(df1, start.term = as.Date(start.term),
end.term = as.Date(end.term))
df2a <- transform(df2, Date = as.Date(Date))
sqldf("select distinct b.rowid, *
from df1a a
join df2a b on
(b.Date < a.[start.term] or b.Date > a.[end.term]) and
b.speaker like '%' || a.surname || '%'")
給予:
rowid surname start.term end.term Date speaker
1 1 Draghi 2011-11-01 2019-10-31 2010-01-01 Mario Draghi
2 2 Duisenberg 1998-06-01 2003-10-31 1997-01-01 W.L. Duisenberg
3 7 Trichet 2003-11-01 2011-10-31 2000-01-01 Jean-Claude Trichet
4 10 Draghi 2011-11-01 2019-10-31 1996-01-01 Mario Draghi
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/512688.html
標籤:r数据框for循环if 语句
上一篇:將資料發送到服務陣列蠻力
