我正在嘗試閱讀此處找到的鑲木地板格式的 02-2019 fhv 資料
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-02.parquet
但是當我嘗試用 Pandas 讀取資料時
df = pd.read_parquet('fhv_tripdata_2019-02.parquet')
它拋出錯誤:
File "pyarrow/table.pxi", line 1156, in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33106123800000000
有誰知道如何列印出有問題的行或強制這些值?讓它忽略這些行?
uj5u.com熱心網友回復:
該資料集中的其中一行已將其 dropOff 設定為3019-02-03 17:30:00.000000。這是超出范圍的pandas.Timestamp。我認為這是注定的2019-02-03 17:30:00.000000。
一種選擇是忽略該錯誤:
import pyarrow.parquet as pq
df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)
但是那個錯誤的時間戳會溢位并有一些奇怪的值:
>>> df['dropOff_datetime'].min()
Timestamp('1849-12-25 18:20:52.580896768')
或者,您可以過濾掉 pyarrow 中超出范圍的值:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
table = pq.read_table("fhv_tripdata_2019-02.parquet")
df = table.filter(
pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
).to_pandas()
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/537303.html
