我有一個非常大的 jsonl 檔案(幾百萬行)。
我想根據給定值對這個檔案進行排序,但我不想將它完全加載到 RAM 中。
你有什么建議的解決方案嗎?
我查看jq了一個sort_by選項,但我認為該檔案沒有流式傳輸。
額外說明:
- 組之間的順序無關緊要
- 如果該方法需要拆分檔案,那么擁有與用戶名一樣多的輸出對我也有好處。
例子 :
這是我的輸入檔案的一個虛擬示例:
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user3", "email": "email1", "value": "40"}
這是我想要的輸出:
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user3", "email": "email1", "value": "40"}
uj5u.com熱心網友回復:
一種方法是將檔案轉換為可以通過處理有限記憶體的工具(例如 sortunix 命令列實用程式)排序的行。
您可以使用以下內容:
jq -r '"\( .username )\u0000\( tojson )"' a.json |
sort |
jq -Rc '. / "\u0000" | .[-1] | fromjson'
對于提供的輸入,以上產生以下輸出:
{"username":"user1","email":"email1","value":"10"}
{"username":"user1","email":"email1","value":"40"}
{"username":"user1","email":"email1","value":"5"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user3","email":"email1","value":"40"}
{"username":"user3","email":"email3","value":"15"}
同樣,您可以生成一個可以注入資料庫jq -r '"\( .username )\t\( tojson )"'的 TSV ( ) 。然后是一個簡單的 SQL 查詢來提取已排序的 JSON 檔案。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/443927.html
上一篇:在Ansible中讀取時如何處理JSON陣列中缺少的鍵
下一篇:需要想法如何決議以下JSON格式
