我最近一直在研究IBTrACS 資料集,并希望將其轉換為具有正確資料型別的 2D numpy 陣列。我進行了一些過濾并選擇了我需要的資料子集,它是一個包含以下列的二維陣列:
Column number - Data type
0 - integer (season)
1 - string (name)
2 - timestamp
3-4 - float-typed columns
5-20 - other integer-typed columns
我隨后還用占位符填充了空值,例如None(NaN) 表示浮點數和-99999整數。當我曾經astype讓 numpy 識別陣列中的資料型別時,它顯然無法逐列處理它們,并且即使沒有必要也試圖將字串轉換為整數。
下面是一個 MCVE。
代碼:
import numpy as np
import csv
from datetime import datetime
import pytz
# reading the dataset
with open('ibtracs.WP.list.v04r00.csv', 'r') as file:
data = list(csv.reader(file, delimiter=','))
# remove CSV headers
ds = np.array(data[2:])
# selecting subsets of the data
mask_jtwc = ds[:,17] == 'jtwc_wp'
ds_jtwc = ds[mask_jtwc,:]
# remove unnecessary columns
columns_to_drop = [3,4] list(range(8,13)) [14,15,17,18,21,22, 25] list(range(38,161))
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
# further filtering
mask_nature = ds_jtwc[:,5] == 'TS'
ds_jtwc = ds_jtwc[mask_nature,:]
mask_tracktype = ds_jtwc[:,6] == 'main'
ds_jtwc = ds_jtwc[mask_tracktype,:]
mask_iflag = [True if item[0] != '_' else False for item in ds_jtwc[:,7]]
ds_jtwc = ds_jtwc[mask_iflag,:]
# remove columns that helped us perform the last step but not needed any more
columns_to_drop = [0,2,5,6,7]
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
columns_to_drop = list(range(8, ds_jtwc.shape[1])) # representative columns only
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
# manual processing to handle empty data and timestamps
dataset = ds_jtwc.tolist()
converted_set = []
for row in dataset:
converted_row = []
for i in range(len(row)):
if i == 1: # string type
converted_row.append(str(row[i]))
elif i == 2: # timestamp
timestamp = datetime.strptime(row[i], '%Y-%m-%d %H:%M:%S')
# timestamp = timestamp.replace(tzinfo=pytz.UTC) # no need timezones for modern numpy
converted_row.append(timestamp)
elif i == 3 or i == 4: # float type
if row[i] == " ":
converted_row.append(None) # NaN
else: converted_row.append(float(row[i]))
else: # default to integers
if row[i] == " ":
converted_row.append(-99999) # placeholder
else: converted_row.append(int(row[i]))
converted_set.append(converted_row)
dataset = np.array(converted_set)
# get sample data for reference
random_index = np.random.choice(dataset.shape[0], size=1, replace=False)
print("Sample data (row {0}):".format(random_index))
print(dataset[random_index, :])
print("Sample data (row 1):")
print(dataset[0])
### Code in question ###
print(dataset)
print(dataset.dtype)
dataset = dataset.astype([
('SEASON', 'i'),
('NAME', 'S'),
('ISO_TIME', 'datetime64[s]'),
('USA_LAT', 'f'),('USA_LON', 'f'),
('USA_WIND', 'i'),('USA_PRES', 'i'),
('USA_R34_NE', 'i')
])
print(dataset)
print(dataset.dtype)
輸出:
Sample data (row [38692]):
[[1999 'MAGGIE' datetime.datetime(1999, 6, 8, 6, 0) 23.6 111.0 20 -99999
-99999]]
Sample data (row 1):
[1945 'ANN' datetime.datetime(1945, 4, 19, 12, 0) 9.5 160.3 25 -99999
-99999]
[[1945 'ANN' datetime.datetime(1945, 4, 19, 12, 0) ... 25 -99999 -99999]
[1945 'ANN' datetime.datetime(1945, 4, 19, 18, 0) ... 30 -99999 -99999]
[1945 'ANN' datetime.datetime(1945, 4, 20, 0, 0) ... 35 -99999 -99999]
...
[2019 'PHANFONE' datetime.datetime(2019, 12, 28, 12, 0) ... 25 1009
-99999]
[2019 'PHANFONE' datetime.datetime(2019, 12, 28, 18, 0) ... 20 1011
-99999]
[2019 'PHANFONE' datetime.datetime(2019, 12, 29, 0, 0) ... 20 1010
-99999]]
object
Traceback (most recent call last):
File "D:\path\Documents\Programming\path\Dataset\_forstackoverflow.py", line 64, in <module>
dataset = dataset.astype([
ValueError: invalid literal for int() with base 10: 'ANN'
If I do not perform the astype step, the data type will turn out to be object, but I believe it will be more convenient in the future if the data are already of the right types. I have also tried to specify sizes, but it gave me an identical error.
Code:
dataset = dataset.astype([
('SEASON', 'i4'),
('NAME', 'U16'),
('ISO_TIME', 'datetime64[s]'),
('USA_LAT', 'f'),('USA_LON', 'f'),
('USA_WIND', 'i4'),('USA_PRES', 'i4'),
('USA_R34_NE', 'i4')
])
I wonder what is wrong with or missing in my astype call. Thanks in advance!
uj5u.com熱心網友回復:
為了說明我的最后一條評論:
In [9]: arr = np.array([[1,2,'word'],[3,4,'other']])
In [10]: arr
Out[10]:
array([['1', '2', 'word'],
['3', '4', 'other']], dtype='<U21')
In [11]: arr.astype('i,i,U10')
Traceback (most recent call last):
File "<ipython-input-11-3800d012c681>", line 1, in <module>
arr.astype('i,i,U10')
ValueError: invalid literal for int() with base 10: 'word'
但是,如果我制作一個元組串列:
In [14]: alist = [tuple(row) for row in arr]
In [15]: alist
Out[15]: [('1', '2', 'word'), ('3', '4', 'other')]
In [16]: np.array(alist, dtype='i,i,U10')
Out[16]:
array([(1, 2, 'word'), (3, 4, 'other')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U10')])
或者
In [17]: import numpy.lib.recfunctions as rf
In [19]: rf.unstructured_to_structured(arr, np.dtype('i,i,U10'))
Out[19]:
array([(1, 2, 'word'), (3, 4, 'other')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U10')])
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/348736.html
下一篇:添加華為套件時出現“Couldnotfindcom.huawei.hms:location:6.0.0.302”錯誤
