我在這里發現了一個類似的問題:Read CSV with PyArrow
在這個答案中,它參考了 sys.stdin.buffer 和 sys.stdout.buffer,但我不確定如何使用它來撰寫 .arrow 檔案或命名它。我似乎無法在 pyarrow 的檔案中找到我正在尋找的確切資訊。我的檔案不會有任何 nans,但它會有一個帶時間戳的索引。該檔案約為 100 GB,因此無法將其加載到記憶體中。我嘗試更改代碼,但正如我所假設的,代碼最終會在每個回圈中覆寫前一個檔案。
***這是我的第一篇文章。我要感謝所有貢獻者,他們在我問他們之前就回答了我 99.9% 的其他問題。
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
writer = None
for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()
下面是我在命令列中使用的代碼
>cat data.csv | python test.py
uj5u.com熱心網友回復:
正如@Pace 所建議的,您應該考慮將輸出檔案的創建移到讀取回圈之外。像這樣的東西:
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
if __name__ == "__main__":
main()
sys.stdin.buffer如果您更愿意指定特定的輸入和輸出檔案,則也不必使用。然后,您可以將腳本運行為:
python test.py
通過使用with陳述句,writer和 之后sink都會自動關閉(在這種情況下main()回傳時)。這意味著不需要包含顯式close()呼叫。
uj5u.com熱心網友回復:
改編自@Martin-Evans 代碼的解決方案:
按照@Pace 的建議,在 for 回圈之后關閉檔案
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1000000
def main():
schema = pa.Table.from_pandas(pd.read_csv('Data.csv',nrows=2)).schema
### reads first two lines to define schema
with pa.OSFile('test.arrow', 'wb') as sink:
with pa.RecordBatchFileWriter(sink, schema) as writer:
for split in pd.read_csv('Data.csv',chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/326237.html
