將數字和日期拆分為單獨的列-有解無憂

我的資料包含具有三個重要特征的文本字串，一個由“：”分隔的 ID 號以及一個開始日期和一個結束日期。我需要將這些樹編號分成三個單獨的列。我嘗試了不同的解決方案，從 unnest_tokens、grepl/grep 到分離，但似乎無法正確處理，我可能會得到一個日期，但我似乎無法以正確的順序或將它們放入資料框架。

輸入資料

input<- data.frame(
  id=c(1,2,3),
  value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021 
  21.02.2022",
  "but the strings always end with the same - document id, start date: and end date  2022:02 
  30.04.2020 18.02.2022",
  "so I need to split document id, start and end dates into separate columns 2000:01 
  07.10.2000 15.02.2021")
  )

期望的輸出

output <-data.frame(
 id=c(1,2,3),
 value=c("a long title containing all sorts - off `characters`",
 "but the strings allwasys end with the same - document id, start date: and end date",
 "so i need to split document id, start and end dates into seperate collumns"),
 docid=c("2022:03", "2022:02", "2000:01"),
 start=c("29.10.2021", "30.04.2020", "07.10.2000"),
 end=c("21.02.2022", "18.02.2022", "15.02.2021")
  )

uj5u.com熱心網友回復：

這可以通過以下方式最方便地完成extract：在其regex引數中，我們將我們想要拆分為列的字串詳盡地描述為一種復雜的模式，其中需要進入列的部分被包裝到捕獲組中(...)：

library(tidyr)
input %>%
  extract(value,
          into = c("value", "docid", "start", "end"),
          regex = "(.*)\\s(\\d{4}:\\d{2})\\s{1,}(.*)\\s{1,}(.*)")
  id                                                                             value   docid      start
1  1                              a long title containing all sorts - off `characters` 2022:03 29.10.2021
2  2 but the strings always end with the same - document id, start date: and end date  2022:02 30.04.2020
3  3         so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
         end
1 21.02.2022
2 18.02.2022
3 15.02.2021

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/430720.html

標籤：r 细绳 grep 弦乐

上一篇：將元組解包為'format()'字串

下一篇：通過將Pyspark上其他2個資料幀的列相乘來創建一個資料幀