我對 R 有很好的理解，但我不熟悉 JSON 檔案型別和決議的最佳實踐。我在從原始 JSON 檔案構建資料框時遇到困難。JSON 檔案（下面的資料）由重復測量資料組成，每個用戶有多個觀察結果。

當原始檔案被讀入 r

 jdata<-read_json("./raw.json")

它以“1 的串列”的形式出現，該串列是 user_ids。在每個 user_id 中都有進一步的串列，就像這樣 -

jdata$user_id$`sjohnson`$date$`2020-09-25`$city

最后一個位置實際上分為兩個選項 - $city 或 $zip。在最高級別，完整檔案中大約有 89 個用戶。

我的目標是最終得到一個矩形資料框或多個資料框，我可以像這樣將它們合并在一起 - 我實際上不需要郵政編碼。

示例表

我已經嘗試過 jsonlite 和 tidyverse ，我似乎得到的最遠的是一個資料框，在最小級別有一個變數 - 城市和郵政編碼使用這個交替行

df  <-  as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))

任何幫助/建議更接近上表的建議將不勝感激。我有一種感覺，我無法將其回圈回不同的級別。

這是原始 json 檔案結構的示例：

 {
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
              "city": "Denver",
              "zip": "80014"
            },
            "2020-10-01": {
              "city": "Atlanta",
              "zip": "30301"
            },
            "2020-11-04": {
              "city": "Jacksonville",
              "zip": "14001"
            }
         },
    "asmith: {
      "date": {
        "2020-10-16": {
              "city": "Cleavland",
              "zip": "34321"
        },
        "2020-11-10": {
              "City": "Elmhurst",
              "zip": "00013
            },
            "2020-11-10 08:49:36": {
              "location": null,
              "timestamp": 1605016176013
            }
          }

uj5u.com熱心網友回復：

另一個（直接的）解決方案rrapply()在rrapply-package 中進行繁重的作業：

library(rrapply)
library(dplyr)

rrapply(jdata, how = "melt") %>%
  filter(L5 == "city") %>%
  select(user_id = L2, date = L4, city = value)

#>    user_id       date         city
#> 1 sjohnson 2020-09-25       Denver
#> 2 sjohnson 2020-10-01      Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4   asmith 2020-10-16    Cleavland
#> 5   asmith 2020-11-10     Elmhurst

資料

jdata <- jsonlite::fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')

uj5u.com熱心網友回復：

我們可以一步一步構建我們想要的結構：

library(jsonlite)
library(tidyverse)

df <- fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(city = unlist(value)) %>%
  filter(var == 'city') %>%
  select(-var, -value)

這使：

# A tibble: 5 x 3
  user_id  date       city        
  <chr>    <chr>      <chr>       
1 sjohnson 2020-09-25 Denver      
2 sjohnson 2020-10-01 Atlanta     
3 sjohnson 2020-11-04 Jacksonville
4 asmith   2020-10-16 Cleavland   
5 asmith   2020-11-10 Elmhurst

受@Greg 啟發的替代解決方案，我們更改了最后兩行：

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(value = unlist(value)) %>%
  pivot_wider(names_from = "var") %>%
  select(user_id, date, city)

這給出了幾乎相同的結果，除了城市是 NA 的一種額外情況：

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <chr>               <chr>       
1 sjohnson 2020-09-25          Denver      
2 sjohnson 2020-10-01          Atlanta     
3 sjohnson 2020-11-04          Jacksonville
4 asmith   2020-10-16          Cleavland   
5 asmith   2020-11-10          Elmhurst    
6 asmith   2020-11-10 08:49:36 NA

uj5u.com熱心網友回復：

這是一個解決方案tidyverse：一個自定義函式，unnestable()旨在遞回地將您描述的內容取消嵌套到表格list中。有關此類串列及其表格格式的詳細資訊，請參閱詳細資訊。

解決方案

首先確保存在必要的庫：

library(jsonlite)
library(tidyverse)

然后定義unnestable()函式如下：

unnestable <- function(v) {
  # If we've reached the bottommost list, simply treat it as a table...
  if(all(sapply(
    X = v,
    # Check that each element is a single value (or NULL).
    FUN = function(x) {
      is.null(x) || purrr::is_scalar_atomic(x)
    },
    simplify = TRUE
  ))) {
    v %>%
      # Replace any NULLs with NAs to preserve blank fields...
      sapply(
        FUN = function(x) {
          if(is.null(x))
            NA
          else
            x
        },
        simplify = FALSE
      ) %>%
      # ...and convert this bottommost list into a table.
      tidyr::as_tibble()
  }
  # ...but if this list contains another nested list, then recursively unnest its
  # contents and combine their tabular results.
  else if(purrr::is_scalar_list(v)) {
    # Take the contents within the nested list...
    v[[1]] %>%
      # ...apply this 'unnestable()' function to them recursively...
      sapply(
        FUN = unnestable,
        simplify = FALSE,
        USE.NAMES = TRUE
      ) %>%
      # ...and stack their results.
      dplyr::bind_rows(.id = names(v)[1])
  }
  # Otherwise, the format is unrecognized and yields no results.
  else {
    NULL
  }
}

最后，按如下方式處理 JSON 資料：

# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")


# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)


# View the raw table.
flat_data

當然，您可以根據需要重新格式化此表：

library(lubridate)

flat_data <- flat_data %>%
  dplyr::transmute(
    user_id = as.character(user_id),
    date = lubridate::as_datetime(date),
    city = as.character(city)
  ) %>%
  dplyr::distinct()


# View the reformatted table.
flat_data

結果

給定一個raw.json像這里采樣的檔案

{
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
          "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
        },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
      }
    },
    "asmith": {
      "date": {
        "2020-10-16": {
          "city": "Cleavland",
          "zip": "34321"
        },
        "2020-11-10": {
          "city": "Elmhurst",
          "zip": "00013"
        },
        "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
      }
    }
  }
}

然后unnestable()會產生tibble這樣的

# A tibble: 6 x 6
  user_id  date                city         zip   location     timestamp
  <chr>    <chr>               <chr>        <chr> <lgl>            <dbl>
1 sjohnson 2020-09-25          Denver       80014 NA                  NA
2 sjohnson 2020-10-01          Atlanta      30301 NA                  NA
3 sjohnson 2020-11-04          Jacksonville 14001 NA                  NA
4 asmith   2020-10-16          Cleavland    34321 NA                  NA
5 asmith   2020-11-10          Elmhurst     00013 NA                  NA
6 asmith   2020-11-10 08:49:36 NA           NA    NA       1605016176013

這dplyr將格式化為以下結果：

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <dttm>              <chr>       
1 sjohnson 2020-09-25 00:00:00 Denver      
2 sjohnson 2020-10-01 00:00:00 Atlanta     
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith   2020-10-16 00:00:00 Cleavland   
5 asmith   2020-11-10 00:00:00 Elmhurst    
6 asmith   2020-11-10 08:49:36 NA

細節

串列格式

準確地說，list代表欄位 { group_1, group_2, ..., group_n} 的嵌套分組，它必須是以下形式：

list(
  group_1 = list(
    "value_1" = list(
      group_2 = list(
        "value_1.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_1.1.….n.1" = list(
                   field_a =    1,
                   field_b = TRUE
                 ),
                 "value_1.1.….n.2" = list(
                   field_a =   2,
                   field_c = "2"
                 )
                 # ...
               )
        ),
        "value_1.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    ),
    "value_2" = list(
      group_2 = list(
        "value_2.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_2.1.….n.1" = list(
                   field_a =   3,
                   field_d = 3.0
                 )
                 # ...
               )
        ),
        "value_2.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    )
    # ...
  )
)

表格格式

給定list這種形式的 a ，unnestable()將其展平為以下形式的表格：

# A tibble: … x …
  group_1 group_2   ... group_n         field_a field_b field_c field_d
  <chr>   <chr>     ... <chr>             <dbl> <lgl>   <chr>     <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1       1 TRUE    NA           NA
2 value_1 value_1.1 ... value_1.1.….n.2       2 NA      2            NA
3 value_1 value_1.2 ... value_1.2.….n.1     ... ...     ...         ...
?    ?         ?                 ?              ?  ?       ?             ?
j value_2 value_2.1 ... value_2.1.….n.1       3 NA      NA            3
?    ?         ?                 ?              ?  ?       ?             ?
k value_2 value_2.2 ... value_2.2.….n.1     ... ...     ...         ...
?    ?         ?                 ?              ?  ?       ?             ?

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/342529.html

標籤：r json 解析 jsonlite

上一篇：讀取文本檔案以創建串列，然后轉換為字典python

下一篇：通過RegEx決議帶有不同引號的輸入字串

在r中決議多級json檔案

資料

解決方案

結果

細節

串列格式

表格格式