我的最終目標是從 US HealthData 網站讀取一系??列每日 .xlsx 檔案(名為“Community_Profile_Report_YYYYMMDD_Public.xlsx”):

我的問題是,如何提取關聯的 .xlsx 檔案的所有assetIDs 和 thefilenames并將它們放入串列或資料框中?
這是一個示例檔案的 URL(我需要的多個日期中的一個),以及我在讀入它后打算如何處理它。目錄路徑中“/files/”后面的字母數字短語對應于assetID.
url <- "https://healthdata.gov/api/views/gqxm-d9w9/files/3bf3ff9c-67d3-47af-aad3-799e21de6e3d?download=true&filename=Community Profile Report 20210903.xlsx"
df2 <- read.xlsx(url, sheet=6) %>% select(1,2,6,8,16,76,77) %>% .[-1, ]
colnames(df2) <- c("county","FIPS","state","pop","cases_last7","vaccinated","vacc_prop")
cols.name <- c("FIPS","pop","cases_last7","vaccinated","vacc_prop")
df2[cols.name] <- sapply(df2[cols.name], as.numeric)
df2 <- df2 %>% filter(complete.cases(.)) %>% mutate(cases_last7_100k = cases_last7/pop * 100000) %>%
mutate(vacc_prop = vacc_prop*100) %>% filter(state != "PR" & vacc_prop < 100)
謝謝,大衛
uj5u.com熱心網友回復:
一種方式是通過函式parse_json 從jsonlite包。
第一個元素的快速而丑陋的示例:
require(jsonlite)
require(magrittr)
dt <- read_json(path = 'https://healthdata.gov/resource/6hii-ae4f.json')
# as an example, consider only first item from the list
dt2 <- dt[[1]]$metadata_published %>% parse_json
# str(dt2) to see the parsed object structure.
# we want to loop through the `attachments`
dt2$attachments %>% lapply(.,function(attachment){
c(filename = attachment$filename, assetId = attachment$assetId)
}) %>% do.call(rbind,.) -> extracted_list
# filter for `xlsx` entries
subset(extracted_list, grepl('*.xlsx', extracted_list[,'filename']))
對于一般情況,我們可以將其包裝為一個函式
'extract_attachments' <- function(x) {
require(jsonlite)
require(magrittr)
x$metadata_published %>% parse_json %>% {
.$attachments
} %>%
lapply(., function(attachment) {
c(filename = attachment$filename,
assetId = attachment$assetId)
}) %>% do.call(rbind, .) %>%
subset(., grepl('*.xlsx', .[, 'filename']))
}
并遍歷整個json:
lapply(dt,extract_attachments) %>% do.call(rbind,.)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/336890.html
上一篇:更改一組堆疊條形圖中的軸標簽
