我正在嘗試將一個大JSON檔案(4.35 GB)轉換為CSV.
我最初的方法是匯入它,將其轉換為資料框(我只features需要CSV.
with open('Risk_of_Flooding_from_Rivers_and_Sea.json') as data_file:
d = json.load(data_file)
# Grabbing the data in 'features'.
json_df = json_normalize(d, 'features')
df = pd.DataFrame(json_df)
我用整個資料集的小樣本成功地做到了這一點,但我無法一次匯入整個東西,即使讓它運行了 9 個小時。盡管我的 PC 有 16 GB 的 RAM,但我認為這是記憶體問題,即使沒有錯誤。
JSON這是我正在使用的資料的一個小樣本:
{
"type": "FeatureCollection",
"crs": {
"type": "name",
"properties": {
"name": "EPSG:27700"
}
},
"features": [
{
"type": "Feature",
"id": 1,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
289344.50009999985,
60397.26009999961
],
[
289347.2400000002,
60400
]
]
]
},
"properties": {
"OBJECTID": 1,
"prob_4band": "Low",
"suitability": "National to County",
"pub_date": 1522195200000,
"shape_Length": 112.16436096255808,
"shape_Area": 353.4856092588217
}
},
{
"type": "Feature",
"id": 2,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
289250,
60550
],
[
289200,
60550
]
]
]
},
"properties": {
"OBJECTID": 2,
"prob_4band": "Very Low",
"suitability": "National to County",
"pub_date": 1522195200000,
"shape_Length": 985.6295076665662,
"shape_Area": 18755.1377842949
}
},
我已經考慮將JSON檔案分成更小的塊,但我的嘗試沒有成功。使用下面的代碼,我得到了錯誤
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1).
with open(os.path.join('E:/Jupyter', 'Risk_of_Flooding_from_Rivers_and_Sea.json'), 'r',
encoding='utf-8') as f1:
ll = [json.loads(line.strip()) for line in f1.readlines()]
print(len(ll))
size_of_the_split = 10000
total = len(ll) // size_of_the_split
print(total 1)
for i in range(total 1):
json.dump(ll[i * size_of_the_split:(i 1) * size_of_the_split], open(
"E:/Jupyter/split" str(i 1) ".json", 'w',
encoding='utf-8'), ensure_ascii=False, indent=True)
我只是想知道我的選擇是什么。我這樣做的方式是最好的方式嗎?如果是,我能改變什么?
我從這個來源獲得了較小的樣本,但它們不能太大。
uj5u.com熱心網友回復:
要拆分資料,您可以使用流決議器,例如ijson ,例如
import ijson
import itertools
import json
chunk_size = 10_000
filename = 'Risk_of_Flooding_from_Rivers_and_Sea.json'
with open(filename, mode='rb') as file_in:
features = ijson.items(file_in, 'features.item', use_float=True)
chunk = list(itertools.islice(features, chunk_size))
count = 1
while chunk:
with open(f'features-split-{count}.json', mode='w') as file_out:
json.dump(chunk, file_out, ensure_ascii=False, indent=4)
chunk = list(itertools.islice(features, chunk_size))
count = 1
uj5u.com熱心網友回復:
使用在線資源,pandas df.to_csv 用一個 181MB 大小的 json 檔案相對較快地完成了這個技巧。我假設它會對更大的檔案做同樣的事情。
import wizzi_utils as wu # pip install wizzi_utils
import pandas
def func():
"""
source https://princekfrancis.medium.com/convert-large-json-file-into-csv-using-python-769d413b8afd
json file from https://github.com/zemirco/sf-city-lots-json/blob/master/citylots.json
:return:
"""
json_path = './citylots.json'
print('file {}: {}'.format(json_path, wu.file_or_folder_size(json_path)))
timer_begin = wu.get_timer()
j = wu.jt.load_json(json_path, ack=False)
timer_end = wu.get_timer_delta(s_timer=timer_begin, with_ms=False)
print('json loading time {}'.format(timer_end))
timer_begin = wu.get_timer()
df = pandas.json_normalize(j['features']) # load json into data frame
timer_end = wu.get_timer_delta(s_timer=timer_begin, with_ms=False)
print('json_normalize time {}'.format(timer_end))
csv_output_path = './citylots.csv'
timer_begin = wu.get_timer()
df.to_csv(csv_output_path, sep=',', encoding='utf-8') # save as csv
timer_end = wu.get_timer_delta(s_timer=timer_begin, with_ms=False)
print('csv creation time {}'.format(timer_end))
print('file {}: {}'.format(csv_output_path, wu.file_or_folder_size(csv_output_path)))
return
def main():
func()
return
if __name__ == '__main__':
# main()
wu.main_wrapper(
main_function=main,
seed=42,
ipv4=False,
cuda_off=False,
torch_v=False,
tf_v=False,
cv2_v=True,
with_pip_list=False,
with_profiler=False
)
和輸出:
file ./citylots.json: 180.99 MB
json loading time 0:00:05
json_normalize time 0:00:02
csv creation time 0:00:09
file ./citylots.csv: 132.63 MB
Total run time 0:00:18
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/428826.html
下一篇:XSLT獲取最后一個節點名稱
