打開一個大的JSON檔案并將其轉換為CSV-有解無憂

我正在嘗試將一個大JSON檔案（4.35 GB）轉換為CSV.

我最初的方法是匯入它，將其轉換為資料框（我只features需要CSV.

with open('Risk_of_Flooding_from_Rivers_and_Sea.json') as data_file:    
    d = json.load(data_file)  

# Grabbing the data in 'features'.
json_df = json_normalize(d, 'features')
df = pd.DataFrame(json_df)

我用整個資料集的小樣本成功地做到了這一點，但我無法一次匯入整個東西，即使讓它運行了 9 個小時。盡管我的 PC 有 16 GB 的 RAM，但我認為這是記憶體問題，即使沒有錯誤。

JSON這是我正在使用的資料的一個小樣本：

{
    "type": "FeatureCollection",
    "crs": {
        "type": "name",
        "properties": {
            "name": "EPSG:27700"
        }
    },
    "features": [
        {
            "type": "Feature",
            "id": 1,
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            289344.50009999985,
                            60397.26009999961
                        ],
                        [
                            289347.2400000002,
                            60400
                        ]
                    ]
                ]
            },
            "properties": {
                "OBJECTID": 1,
                "prob_4band": "Low",
                "suitability": "National to County",
                "pub_date": 1522195200000,
                "shape_Length": 112.16436096255808,
                "shape_Area": 353.4856092588217
            }
        },
        {
            "type": "Feature",
            "id": 2,
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            289250,
                            60550
                        ],
                        [
                            289200,
                            60550
                        ]
                    ]
                ]
            },
            "properties": {
                "OBJECTID": 2,
                "prob_4band": "Very Low",
                "suitability": "National to County",
                "pub_date": 1522195200000,
                "shape_Length": 985.6295076665662,
                "shape_Area": 18755.1377842949
            }
        },

我已經考慮將JSON檔案分成更小的塊，但我的嘗試沒有成功。使用下面的代碼，我得到了錯誤

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1).

with open(os.path.join('E:/Jupyter', 'Risk_of_Flooding_from_Rivers_and_Sea.json'), 'r',
          encoding='utf-8') as f1:
    ll = [json.loads(line.strip()) for line in f1.readlines()]
    
    print(len(ll))
          
    size_of_the_split = 10000
    total = len(ll) // size_of_the_split
          
    print(total 1)
          
    for i in range(total 1):
        json.dump(ll[i * size_of_the_split:(i   1) * size_of_the_split], open(
            "E:/Jupyter/split"   str(i 1)   ".json", 'w',
            encoding='utf-8'), ensure_ascii=False, indent=True)

我只是想知道我的選擇是什么。我這樣做的方式是最好的方式嗎？如果是，我能改變什么？

我從這個來源獲得了較小的樣本，但它們不能太大。

uj5u.com熱心網友回復：

要拆分資料，您可以使用流決議器，例如ijson ，例如

import ijson
import itertools
import json

chunk_size = 10_000

filename = 'Risk_of_Flooding_from_Rivers_and_Sea.json'

with open(filename, mode='rb') as file_in:
    features = ijson.items(file_in, 'features.item', use_float=True)
    chunk = list(itertools.islice(features, chunk_size))
    count = 1
    while chunk:
        with open(f'features-split-{count}.json', mode='w') as file_out:
            json.dump(chunk, file_out, ensure_ascii=False, indent=4)
        chunk = list(itertools.islice(features, chunk_size))
        count  = 1

uj5u.com熱心網友回復：

使用在線資源，pandas df.to_csv 用一個 181MB 大小的 json 檔案相對較快地完成了這個技巧。我假設它會對更大的檔案做同樣的事情。

import wizzi_utils as wu  # pip install wizzi_utils
import pandas


def func():
    """
    source https://princekfrancis.medium.com/convert-large-json-file-into-csv-using-python-769d413b8afd
    json file from https://github.com/zemirco/sf-city-lots-json/blob/master/citylots.json
    :return:
    """
    json_path = './citylots.json'
    print('file {}: {}'.format(json_path, wu.file_or_folder_size(json_path)))

    timer_begin = wu.get_timer()
    j = wu.jt.load_json(json_path, ack=False)
    timer_end = wu.get_timer_delta(s_timer=timer_begin, with_ms=False)
    print('json loading time {}'.format(timer_end))

    timer_begin = wu.get_timer()
    df = pandas.json_normalize(j['features'])  # load json into data frame
    timer_end = wu.get_timer_delta(s_timer=timer_begin, with_ms=False)
    print('json_normalize time {}'.format(timer_end))

    csv_output_path = './citylots.csv'
    timer_begin = wu.get_timer()
    df.to_csv(csv_output_path, sep=',', encoding='utf-8')  # save as csv
    timer_end = wu.get_timer_delta(s_timer=timer_begin, with_ms=False)
    print('csv creation time {}'.format(timer_end))
    print('file {}: {}'.format(csv_output_path, wu.file_or_folder_size(csv_output_path)))
    return


def main():
    func()
    return


if __name__ == '__main__':
    # main()
    wu.main_wrapper(
        main_function=main,
        seed=42,
        ipv4=False,
        cuda_off=False,
        torch_v=False,
        tf_v=False,
        cv2_v=True,
        with_pip_list=False,
        with_profiler=False
    )

和輸出：

file ./citylots.json: 180.99 MB
json loading time 0:00:05
json_normalize time 0:00:02
csv creation time 0:00:09
file ./citylots.csv: 132.63 MB
Total run time 0:00:18

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/428826.html

標籤：Python json 熊猫 CSV 大数据

上一篇：需要忽略行并替換csvpython中的單詞

下一篇：XSLT獲取最后一個節點名稱