我有部分不規則形式的采訪記錄:
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FüNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
我需要做的是通過將資料的關鍵元素提取到資料幀的列中來構建資料。有四個這樣的關鍵要素:
Role面試中:受訪者或面試官Utterance: 采訪合伙人致辭Timestamp由#兩端表示Gap用括號中的十進制數表示
問題是Timestamp和Gap的提供不一致。雖然我可以將最后一個捕獲組Gap設為可選,但那些既Timestamp沒有Gap也沒有正確呈現的字串:
我正在使用extractfromtidyr進行提取:
library(tidyr)
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\\w{2}:\\s|\\s )([\\S\\s] ?)\\s*#([^#] )?#\\s*(\\([0-9.] \\))?\\s*")
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FüNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA>
如何改進正則運算式以便獲得所需的輸出:
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FüNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
uj5u.com熱心網友回復:
復雜正則運算式的替代方法是使用具有更簡單正則運算式的多個提取。然后將任何 NA 轉換為 "" 并去除不需要的空格。
library(dplyr)
library(tidyr)
data.frame(tst) %>%
extract(tst, "Gap", "(\\(.*?\\))", remove = FALSE) %>%
extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
extract(tst, c("Role", "Utterance"), "^(\\S :|)([^#]*)") %>%
mutate(across(, coalesce, ""), Utterance = trimws(Utterance))
給予:
Role Utterance Timestamp Gap
1 In: ja COOL; #00:04:24-6#
2 in den vier, FüNF wochen, #00:04:57-8#
3 In: jah, #00:02:07-8#
4 In: [ja; ] #00:03:25-5#
5 also jA:h; #00:03:16-6# (1.1)
6 Bz: [E::hm; ] #00:03:51-4# (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
uj5u.com熱心網友回復:
您可以更新您的模式以使用您的 4 個捕獲組,并通過可選地匹配第 3 組和第 4 組并斷言字串的結尾來使最后一部分成為可選:
library(tidyr)
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FüNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\\w{2}:\\s|\\s )([\\s\\S]*?)(?:\\s*#([^#] )(?:#\\s*(\\([0-9.] \\))?\\s*)?)?$")
輸出
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FüNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] #00:03:25-5# [ja; ] 00:03:26-1
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/368529.html
上一篇:如果然后R中的陳述句-基于值替換
