如何訪問熊貓資料框中的嵌套資料？-有解無憂

這是我正在使用的資料的示例：

  values                                            variable.variableName    timeZone
0 [{'value': [],                                    turbidity                PST
  'qualifier': [], 
  'qualityControlLevel': [], 
  'method': [{
      'methodDescription': '[TS087: YSI 6136]', 
      'methodID': 15009}], 
  'source': [], 
  'offset': [], 
  'sample': [], 
  'censorCode': []}, 
 {'value': [{
      'value': '17.2', 
      'qualifiers': ['P'], 
      'dateTime': '2022-01-05T12:30:00.000-08:00'},
     {'value': '17.5', 
      'qualifiers': ['P'], 
      'dateTime': '2022-01-05T14:00:00.000-08:00'}
  }]
1 [{'value':                                        degC                     PST
     [{'value': '9.3', 
       'qualifiers': ['P'], 
       'dateTime': '2022-01-05T12:30:00.000-08:00'}, 
      {'value': '9.4', 
       'qualifiers': ['P'], 
       'dateTime': '2022-01-05T12:45:00.000-08:00'},
  }]

我正在嘗試將資料中的每個變數分解為它們自己的資料幀，但是，如果有多組值（例如濁度），我到目前為止的作業是有效的；它只拉入第一組，有時是空的。如何提取所有值集？這是我到目前為止所擁有的：

import requests
import pandas as pd

url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()

json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)

new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']

# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light, 
780-900 nm, detection angle 90 &#177;2.5&#176;, formazin nephelometric units (FNU)'])

這輸出：

turbidity df
Empty DataFrame
Columns: []
Index: []

degC df
     value        qualifiers       dateTime
0    9.3          P                2022-01-05T12:30:00.000-08:00    
1    9.4          P                2022-01-05T12:45:00.000-08:00

而我希望我的輸出是這樣的：

turbidity df
     value        qualifiers       dateTime
0    17.2         P                2022-01-05T12:30:00.000-08:00    
1    17.5         P                2022-01-05T14:00:00.000-08:00


degC df
     value        qualifiers       dateTime
0    9.3          P                2022-01-05T12:30:00.000-08:00    
1    9.4          P                2022-01-05T12:45:00.000-08:00

不幸的是，它只抓取了第一個值集，在濁度的情況下它是空的。如何將它們全部抓取或檢查資料框是否為空并抓取下一個？

uj5u.com熱心網友回復：

我相信這里缺少的鏈接是DataFrame.explode() - 它允許您將包含值串列（您的"values"列）的單行拆分為多行。

然后你可以使用

new_df = df.explode("values")

這會將"turbidity"行分成兩部分。

然后，您可以使用空"value"字典過濾行并.explode()再次應用。

然后，您還可以pd.json_normalize再次使用將值字典擴展為多列，或者查看Series.str.get()從字典或串列中提取單個元素。

uj5u.com熱心網友回復：

這個 JSON 嵌套很深，所以我認為它需要幾個步驟才能轉換成你想要的。

# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])

# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])

# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True), 
                pd.json_normalize(df.value).explode('qualifiers')], axis=1)

結果資料框應如下所示。

    variable.variableName      value qualifiers               dateTime
0   Temperature, water, &#176;C 10.7          P 2022-01-06T12:15:00.000-08:00
1   Temperature, water, &#176;C 10.7          P 2022-01-06T12:30:00.000-08:00
2   Temperature, water, &#176;C 10.7          P 2022-01-06T12:45:00.000-08:00
3   Temperature, water, &#176;C 10.8          P 2022-01-06T13:00:00.000-08:00

如果您要進行進一步的資料處理，最好將所有內容保存在 1 個資料框中，但如果您確實需要單獨的資料框，請通過過濾將其取出。

df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/404914.html

標籤：

上一篇：如何在python中替換兩個for's()、一個串列和一個資料框的使用？

下一篇：django：如何使查詢不被懶惰地執行？