我對 R 有很好的理解,但我不熟悉 JSON 檔案型別和決議的最佳實踐。我在從原始 JSON 檔案構建資料框時遇到困難。JSON 檔案(下面的資料)由重復測量資料組成,每個用戶有多個觀察結果。
當原始檔案被讀入 r
jdata<-read_json("./raw.json")
它以“1 的串列”的形式出現,該串列是 user_ids。在每個 user_id 中都有進一步的串列,就像這樣 -
jdata$user_id$`sjohnson`$date$`2020-09-25`$city
最后一個位置實際上分為兩個選項 - $city 或 $zip。在最高級別,完整檔案中大約有 89 個用戶。
我的目標是最終得到一個矩形資料框或多個資料框,我可以像這樣將它們合并在一起 - 我實際上不需要郵政編碼。
示例表
我已經嘗試過 jsonlite 和 tidyverse ,我似乎得到的最遠的是一個資料框,在最小級別有一個變數 - 城市和郵政編碼使用這個交替行
df <- as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))
任何幫助/建議更接近上表的建議將不勝感激。我有一種感覺,我無法將其回圈回不同的級別。
這是原始 json 檔案結構的示例:
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
},
"asmith: {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"City": "Elmhurst",
"zip": "00013
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
uj5u.com熱心網友回復:
另一個(直接的)解決方案rrapply()在rrapply-package 中進行繁重的作業:
library(rrapply)
library(dplyr)
rrapply(jdata, how = "melt") %>%
filter(L5 == "city") %>%
select(user_id = L2, date = L4, city = value)
#> user_id date city
#> 1 sjohnson 2020-09-25 Denver
#> 2 sjohnson 2020-10-01 Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4 asmith 2020-10-16 Cleavland
#> 5 asmith 2020-11-10 Elmhurst
資料
jdata <- jsonlite::fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
uj5u.com熱心網友回復:
我們可以一步一步構建我們想要的結構:
library(jsonlite)
library(tidyverse)
df <- fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(city = unlist(value)) %>%
filter(var == 'city') %>%
select(-var, -value)
這使:
# A tibble: 5 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
受@Greg 啟發的替代解決方案,我們更改了最后兩行:
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(value = unlist(value)) %>%
pivot_wider(names_from = "var") %>%
select(user_id, date, city)
這給出了幾乎相同的結果,除了城市是 NA 的一種額外情況:
# A tibble: 6 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
6 asmith 2020-11-10 08:49:36 NA
uj5u.com熱心網友回復:
這是一個解決方案tidyverse:一個自定義函式,unnestable()旨在遞回地將您描述的內容取消嵌套到表格list中。有關此類串列及其表格格式的詳細資訊,請參閱詳細資訊。
解決方案
首先確保存在必要的庫:
library(jsonlite)
library(tidyverse)
然后定義unnestable()函式如下:
unnestable <- function(v) {
# If we've reached the bottommost list, simply treat it as a table...
if(all(sapply(
X = v,
# Check that each element is a single value (or NULL).
FUN = function(x) {
is.null(x) || purrr::is_scalar_atomic(x)
},
simplify = TRUE
))) {
v %>%
# Replace any NULLs with NAs to preserve blank fields...
sapply(
FUN = function(x) {
if(is.null(x))
NA
else
x
},
simplify = FALSE
) %>%
# ...and convert this bottommost list into a table.
tidyr::as_tibble()
}
# ...but if this list contains another nested list, then recursively unnest its
# contents and combine their tabular results.
else if(purrr::is_scalar_list(v)) {
# Take the contents within the nested list...
v[[1]] %>%
# ...apply this 'unnestable()' function to them recursively...
sapply(
FUN = unnestable,
simplify = FALSE,
USE.NAMES = TRUE
) %>%
# ...and stack their results.
dplyr::bind_rows(.id = names(v)[1])
}
# Otherwise, the format is unrecognized and yields no results.
else {
NULL
}
}
最后,按如下方式處理 JSON 資料:
# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")
# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)
# View the raw table.
flat_data
當然,您可以根據需要重新格式化此表:
library(lubridate)
flat_data <- flat_data %>%
dplyr::transmute(
user_id = as.character(user_id),
date = lubridate::as_datetime(date),
city = as.character(city)
) %>%
dplyr::distinct()
# View the reformatted table.
flat_data
結果
給定一個raw.json像這里采樣的檔案
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}
然后unnestable()會產生tibble這樣的
# A tibble: 6 x 6
user_id date city zip location timestamp
<chr> <chr> <chr> <chr> <lgl> <dbl>
1 sjohnson 2020-09-25 Denver 80014 NA NA
2 sjohnson 2020-10-01 Atlanta 30301 NA NA
3 sjohnson 2020-11-04 Jacksonville 14001 NA NA
4 asmith 2020-10-16 Cleavland 34321 NA NA
5 asmith 2020-11-10 Elmhurst 00013 NA NA
6 asmith 2020-11-10 08:49:36 NA NA NA 1605016176013
這dplyr將格式化為以下結果:
# A tibble: 6 x 3
user_id date city
<chr> <dttm> <chr>
1 sjohnson 2020-09-25 00:00:00 Denver
2 sjohnson 2020-10-01 00:00:00 Atlanta
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith 2020-10-16 00:00:00 Cleavland
5 asmith 2020-11-10 00:00:00 Elmhurst
6 asmith 2020-11-10 08:49:36 NA
細節
串列格式
準確地說,list代表欄位 { group_1, group_2, ..., group_n} 的嵌套分組,它必須是以下形式:
list(
group_1 = list(
"value_1" = list(
group_2 = list(
"value_1.1" = list(
# .
# .
# .
group_n = list(
"value_1.1.….n.1" = list(
field_a = 1,
field_b = TRUE
),
"value_1.1.….n.2" = list(
field_a = 2,
field_c = "2"
)
# ...
)
),
"value_1.2" = list(
# .
# .
# .
)
# ...
)
),
"value_2" = list(
group_2 = list(
"value_2.1" = list(
# .
# .
# .
group_n = list(
"value_2.1.….n.1" = list(
field_a = 3,
field_d = 3.0
)
# ...
)
),
"value_2.2" = list(
# .
# .
# .
)
# ...
)
)
# ...
)
)
表格格式
給定list這種形式的 a ,unnestable()將其展平為以下形式的表格:
# A tibble: … x …
group_1 group_2 ... group_n field_a field_b field_c field_d
<chr> <chr> ... <chr> <dbl> <lgl> <chr> <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1 1 TRUE NA NA
2 value_1 value_1.1 ... value_1.1.….n.2 2 NA 2 NA
3 value_1 value_1.2 ... value_1.2.….n.1 ... ... ... ...
? ? ? ? ? ? ? ?
j value_2 value_2.1 ... value_2.1.….n.1 3 NA NA 3
? ? ? ? ? ? ? ?
k value_2 value_2.2 ... value_2.2.….n.1 ... ... ... ...
? ? ? ? ? ? ? ?
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/342529.html
