AWSSagemaker輸出如何讀取分布在多行上的多個json物件的檔案-有解無憂

我有一堆看起來像這樣的 json 檔案

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142, ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774, ], "word": "blah blah blah"}

我可以閱讀

f = open(file_name)
data = []
for line in f:
   data.append(json.dumps(line))

但我有另一個輸出這樣的檔案

{
    "predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
    ]
}
{
    "predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
    ]
}
{
    "predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
    ]
}

即 json 被格式化為多行，所以我不能簡單地逐行讀取 json。有沒有簡單的方法來決議這個？或者我是否必須撰寫一些將每個 json 物件逐行拼接在一起的東西，以及 json.loads？

uj5u.com熱心網友回復：

嗯，據我知道有不幸的是沒有辦法來加載JSONL使用格式的資料json.loads。不過，一種選擇是提出一個可以將其轉換為有效 JSON 字串的輔助函式，如下所示：

import json

string = """
{
    "predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
    ]
}
{
    "predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
    ]
}
{
    "predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
    ]
}
"""


def json_lines_to_json(s: str) -> str:
    # replace the first occurrence of '{'
    s = s.replace('{', '[{', 1)

    # replace the last occurrence of '}
    s = s.rsplit('}', 1)[0]   '}]'

    # now go in and replace all occurrences of '}' immediately followed
    # by newline with a '},'
    s = s.replace('}\n', '},\n')

    return s


print(json.loads(json_lines_to_json(string)))

印刷：

[{'predictions': [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]]}, {'predictions': [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]}, {'predictions': [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]]}]

注意：你的第一個例子實際上看起來不像有效的 JSON（或者至少是我理解的 JSON 行）。特別是，由于最后一個陣列元素后面的逗號，這部分似乎無效：

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], ...}

為了確保在呼叫輔助函式后它有效，您還需要洗掉尾隨逗號，因此每一行都采用以下格式：

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], ...},

似乎還有一個類似的問題，他們建議在換行符json.loads上拆分并在每一行上呼叫；實際上，json.loads在每個物件上多次呼叫（而不是在串列中呼叫一次）應該（稍微）降低性能，如下所示。

from timeit import timeit
import json


string = """\
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142 ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774 ], "word": "blah blah blah"}\
"""


def json_lines_to_json(s: str) -> str:

    # Strip newlines from end, then replace all occurrences of '}' followed
    # by a newline, by a '},' followed by a newline.
    s = s.rstrip('\n').replace('}\n', '},\n')

    # return string value wrapped in brackets (list)
    return f'[{s}]'


n = 10_000

print('string replace:        ', timeit(r'json.loads(json_lines_to_json(string))', number=n, globals=globals()))
print('json.loads each line:  ', timeit(r'[json.loads(line) for line in string.split("\n")]', number=n, globals=globals()))

結果：

string replace:         0.07599360000000001
json.loads each line:   0.1078384

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/343655.html

標籤：Python json amazon-sagemaker

上一篇：如何使用GSON反序列化陣列內部的Double陣列？

下一篇：需要構建一個API串列arry