我的資料包含具有三個重要特征的文本字串,一個由“:”分隔的 ID 號以及一個開始日期和一個結束日期。我需要將這些樹編號分成三個單獨的列。我嘗試了不同的解決方案,從 unnest_tokens、grepl/grep 到分離,但似乎無法正確處理,我可能會得到一個日期,但我似乎無法以正確的順序或將它們放入資料框架。
輸入資料
input<- data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021
21.02.2022",
"but the strings always end with the same - document id, start date: and end date 2022:02
30.04.2020 18.02.2022",
"so I need to split document id, start and end dates into separate columns 2000:01
07.10.2000 15.02.2021")
)
期望的輸出
output <-data.frame(
id=c(1,2,3),
value=c("a long title containing all sorts - off `characters`",
"but the strings allwasys end with the same - document id, start date: and end date",
"so i need to split document id, start and end dates into seperate collumns"),
docid=c("2022:03", "2022:02", "2000:01"),
start=c("29.10.2021", "30.04.2020", "07.10.2000"),
end=c("21.02.2022", "18.02.2022", "15.02.2021")
)
uj5u.com熱心網友回復:
這可以通過以下方式最方便地完成extract:在其regex引數中,我們將我們想要拆分為列的字串詳盡地描述為一種復雜的模式,其中需要進入列的部分被包裝到捕獲組中(...):
library(tidyr)
input %>%
extract(value,
into = c("value", "docid", "start", "end"),
regex = "(.*)\\s(\\d{4}:\\d{2})\\s{1,}(.*)\\s{1,}(.*)")
id value docid start
1 1 a long title containing all sorts - off `characters` 2022:03 29.10.2021
2 2 but the strings always end with the same - document id, start date: and end date 2022:02 30.04.2020
3 3 so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
end
1 21.02.2022
2 18.02.2022
3 15.02.2021
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/430720.html
