我有一個檔案 (.docx),可以在下面的鏈接中找到,我使用官員包提取了內容。

使用下面的代碼,我提取了該檔案的內容。
doc <- read_docx("test.docx")
content <- docx_summary(doc)
head(content)
#To get all paragraphs:
par_data <- subset(content, content_type %in% "paragraph")
par_data <- par_data[, c("doc_index", "style_name",
"text") ]
par_data$text <- with(par_data, {
substr(
text, start = 1,
stop = ifelse(nchar(text)<30, nchar(text), 30) )
})
par_data
可以使用以下代碼重現資料幀。
par_data <- data.frame(doc_index = 1:21,
style_name = c("heading 1", "heading 2", "heading 3",NA ,NA,NA, "heading 2", "heading 3", NA,NA,NA, NA,"heading 2", "heading 3", NA, NA, "heading 1", "heading 2","heading 3", NA,NA ),
text = c(' Cardiovascular drugs ', ' ACE inhibitors. ', ' Valsartan ', ' Valsartan is used to treat hig ', ' Side effects ', ' high potassium; headache, dizz ', ' Beta blockers. ', ' propranolol ', ' Propranolol is prescribed for ', ' Side effects ', ' slow or uneven heartbeats', ' wheezing or trouble breathing ', ' Calcium channel blockers. ', ' Nifedipine ', ' Side effects ', ' Bloating or swelling of the fa ', ' Neurological drugs ', ' Anticonvulsants ', ' Phenytoin ', ' Side effects ', ' Decreased coordination, mental '))
我需要的是重塑此資料框以具有如下內容:

事實上,我需要標題 1 和 2 作為列,其中每種藥物(都是標題 3)獲取這些列中最后一個標題的文本。另外,我還需要另外兩列。有些藥物有描述,然后是副作用,而其他藥物只有副作用,它們位于下一個標題 1 或 2 或 3 之前的行中。有沒有直接的方法來實作這一點?任何幫助表示贊賞。
uj5u.com熱心網友回復:
這不僅僅是重塑,需要一些基于先前text和style_name值的推斷,加上“最后一次觀察結轉”(locf)。資料在字串的開頭/結尾也有空格,所以我會用trimws.
dplyr
我認為這可以滿足您的要求:
library(dplyr)
# library(tidyr) # fill
par_data %>%
mutate(across(where(is.character), trimws)) %>%
mutate(
grp = cumsum(is.na(lag(style_name)) & !is.na(style_name)),
style_name = case_when(
is.na(style_name) & lag(text) == "Side effects" ~ "sideeffects",
is.na(style_name) & lag(style_name) == "heading 3" &
!text %in% "Side effects" ~ "description",
TRUE ~ style_name)
) %>%
filter(!is.na(style_name)) %>%
pivot_wider(grp, names_from = "style_name", values_from = "text") %>%
tidyr::fill(`heading 1`)
# # A tibble: 4 x 6
# grp `heading 1` `heading 2` `heading 3` description sideeffects
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 Cardiovascular drugs ACE inhibitors. Valsartan Valsartan is used to treat hig high potas~
# 2 2 Cardiovascular drugs Beta blockers. propranolol Propranolol is prescribed for slow or un~
# 3 3 Cardiovascular drugs Calcium channel blockers. Nifedipine NA Bloating o~
# 4 4 Neurological drugs Anticonvulsants Phenytoin NA Decreased ~
這可以在 tidyverse 以外的地方完成,盡管它仍然可以從外部包函式 ( reshape2::dcast) 中受益......使用起來stats::reshape可能有點麻煩。
資料表
如果您已經在使用(或考慮)data.table,這與上面的大致等效:
library(data.table)
chrs <- which(sapply(par_data, is.character))
as.data.table(par_data)[, c(chrs) := lapply(.SD, trimws), .SDcols = chrs
][, grp := cumsum(is.na(shift(style_name)) & !is.na(style_name))
][, style_name := fcase(
is.na(style_name) & shift(text) == "Side effects", "sideeffects",
is.na(style_name) & lag(style_name) == "heading 3" &
!text %in% "Side effects", "description",
rep(TRUE, .N), style_name)
][!is.na(style_name),
][, dcast(grp ~ style_name, value.var = "text", data = .SD)
][, `heading 1` := zoo::na.locf(`heading 1`)
][, .(`heading 1`, `heading 2`, `heading 3`, description, sideeffects) ]
# heading 1 heading 2 heading 3 description sideeffects
# 1: Cardiovascular drugs ACE inhibitors. Valsartan Valsartan is used to treat hig high potassium; headache, dizz
# 2: Cardiovascular drugs Beta blockers. propranolol Propranolol is prescribed for slow or uneven heartbeats
# 3: Cardiovascular drugs Calcium channel blockers. Nifedipine <NA> Bloating or swelling of the fa
# 4: Neurological drugs Anticonvulsants Phenytoin <NA> Decreased coordination, mental
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/400755.html
