將具有大量數字(18 位)的數字存盤為實木復合地板并檢索時,我遇到了一個奇怪的問題。我得到不同的值。進一步鉆取,看起來這個問題只有在輸入串列是 None 和實際值的混合時才會發生。如果串列沒有 None 值,則按預期取回值。
我不認為它與顯示問題有關。嘗試使用 cat、vi 編輯器等 unix 命令進行顯示,因此它看起來不像是顯示問題。
代碼中有2個部分,
從無和大數字組合的串列中創建鑲木地板。這就是問題所在。例如: value : 235313013750949476 更改為 235313013750949472,如輸出所示。
從串列中創建只有大數字且沒有 None 值的鑲木地板。它按預期作業。
代碼
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def get_row_list():
row_list = []
row_list.append(None)
row_list.append(235313013750949476)
row_list.append(None)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
row_list.append(None)
row_list.append(None)
return row_list
def get_row_list_with_no_none():
row_list = []
row_list.append(235313013750949476)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
return row_list
def create_parquet(row_list, col_list, parquet_filename):
df = pd.DataFrame(row_list, columns=col_list)
schema_field_list = [('tree_id', pa.int64())]
pa_schema = pa.schema(schema_field_list)
table = pa.Table.from_pandas(df, pa_schema)
pq_writer = pq.ParquetWriter(parquet_filename,
schema=pa_schema)
pq_writer.write_table(table)
pq_writer.close()
print("Parquet file [%s] created" % parquet_filename)
def main():
col_list = ['tree_id']
# Row list without any none
row_list = get_row_list_with_no_none()
print (row_list)
create_parquet(row_list, col_list, 'without_none.parquet')
# Row list with none
row_list = get_row_list()
print (row_list)
create_parquet(row_list, col_list, 'with_none.parquet')
# ==== Main code Execution =====
if __name__ == '__main__':
main()
[執行]
python test-parquet.py
[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created
[庫版本]
pyarrow 5.0.0
pandas 1.1.5
python -v
Python 3.6.6
[通過將鑲木地板作為 spark df 進行測驗]
>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
|-- tree_id: long (nullable = true)
>>> dfwithoutnone.show(10, False)
------------------
|tree_id |
------------------
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
------------------
>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.show(10, False)
------------------
|tree_id |
------------------
|null |
|235313013750949472|
|null |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null |
|null |
------------------
我確實在 StackOverflow 中搜索過,找不到任何合適的東西。你能提供一些指點嗎?
謝謝
uj5u.com熱心網友回復:
問題與 Parquet 無關,而是與您row_list對 Pandas DataFrame 的初始轉換有關:
row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)
>>> df
tree_id
0 NaN
1 2.353130e 17
2 NaN
3 1.353130e 17
4 9.353130e 17
5 8.353130e 17
6 NaN
7 NaN
由于存在缺失值,pandas 創建了一個 float64 列。正是這種 int -> float 轉換失去了如此大整數的精度。
稍后再次將浮點數轉換為整數(當使用強制整數列的模式創建 pyarrow 表時)將導致稍微不同的值,正如在 python 中手動執行此操作時所看到的:
>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e 17
>>> int(df.loc[1, "tree_id"])
235313013750949472
一種可能的解決方案是避免臨時 DataFrame。這當然取決于您的確切(真實)用例,但是如果您從上面可重現的示例中的 python 串列開始,您也可以直接從這個值串列創建一個 pyarrow.Table(pa.table({"tree_id": row_list}, schema=..)這將保留確切的Parquet 檔案中的值。
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/331645.html
上一篇:獲取所選檔案的完整路徑
