Python：從s3流式傳輸gzip檔案-有解無憂

我在 s3 中有檔案作為 gzip 塊，因此我必須連續讀取資料并且不能讀取隨機資料。我總是必須從第一個檔案開始。

例如，假設我在 s3, f1.gz, f2.gz,中有 3 個 gzip 檔案f3.gz。如果我全部下載到本地，我可以做到cat * | gzip -d。如果我這樣做cat f2.gz | gzip -d，它將失敗gzip: stdin: not in gzip format。

如何使用 python 從 s3 流式傳輸這些資料？我看到了 smart-open 并且它能夠解壓縮 gz 檔案

from smart_open import smart_open, open

with open(path, compression='.gz') as f:
    for line in f:
        print(line.strip())

其中 path 是f1.gz. 這一直有效，直到它到達檔案的末尾，它將中止。同樣的事情會在本地發生，如果我這樣做，它會在它結束時cat f1.gz | gzip -d出錯。gzip: stdin: unexpected end of file

有沒有辦法讓它使用 python 連續流式傳輸檔案？

這個不會中止，并且可以遍歷f1.gz,f2.gz和f3.gz

with open(path, 'rb', compression='disable') as f:
    for line in f:
        print(line.strip(), end="")

但輸出只是位元組。我在想它會通過python test.py | gzip -d上面的代碼來作業，但我得到一個錯誤gzip: stdin: not in gzip format。有沒有辦法使用 gzip 可以讀取的 smart-open 進行 python 列印？

uj5u.com熱心網友回復：

例如，假設我在 s3, f1.gz, f2.gz,中有 3 個 gzip 檔案f3.gz。如果我全部下載到本地，我可以做到cat * | gzip -d。

一個想法是制作一個檔案物件來實作這一點。檔案物件從一個檔案句柄中讀取、耗盡它、從下一個檔案句柄讀取、耗盡它等等。這類似于cat內部的作業方式。

這樣做的方便之處在于它與連接所有檔案的作用相同，而無需同時讀取所有檔案的記憶體使用。

一旦你有了組合檔案物件包裝器，你可以將它傳遞給 Python 的gzip模塊來解壓檔案。

例子：

import gzip

class ConcatFileWrapper:
    def __init__(self, files):
        self.files = iter(files)
        self.current_file = next(self.files)
    def read(self, *args):
        ret = self.current_file.read(*args)
        if len(ret) == 0:
            # EOF
            # Optional: close self.current_file here
            # self.current_file.close()
            # Advance to next file and try again
            try:
                self.current_file = next(self.files)
            except StopIteration:
                # Out of files
                # Return an empty string
                return ret
            # Recurse and try again
            return self.read(*args)
        return ret
    def write(self):
        raise NotImplementedError()

filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)

with gzip.open(wrapper) as gf:
    for line in gf:
        print(line)

# Close all files
[f.close() for f in filehandles]

這是我測驗的方法：

我創建了一個檔案來通過以下命令對此進行測驗。

創建一個內容為 1 到 1000 的檔案。

$ seq 1 1000 > foo

壓縮它。

$ gzip foo

拆分檔案。這將生成四個名為 xaa-xad 的檔案。

$ split -b 500 foo.gz

在上面運行上面的 Python 檔案，它應該列印出 1 - 1000。

編輯：關于延遲打開檔案的額外說明

如果您有大量檔案，您可能希望一次只打開一個檔案。這是一個例子：

def open_files(filenames):
    for filename in filenames:
        # Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
        yield open(filename, "rb")

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/454788.html

標籤：Python 亚马逊网络服务亚马逊-s3 博托3

上一篇：使用SDK下載整個OBS(S3)存盤桶

下一篇：授予對S3存盤桶中所有內容的訪問權限