NumPy通過astype轉換二維陣列中的混合資料型別-有解無憂

我最近一直在研究IBTrACS 資料集，并希望將其轉換為具有正確資料型別的 2D numpy 陣列。我進行了一些過濾并選擇了我需要的資料子集，它是一個包含以下列的二維陣列：

Column number - Data type
0 - integer (season)
1 - string (name)
2 - timestamp
3-4 - float-typed columns
5-20 - other integer-typed columns

我隨后還用占位符填充了空值，例如None(NaN) 表示浮點數和-99999整數。當我曾經astype讓 numpy 識別陣列中的資料型別時，它顯然無法逐列處理它們，并且即使沒有必要也試圖將字串轉換為整數。

下面是一個 MCVE。

代碼：

import numpy as np
import csv
from datetime import datetime
import pytz

# reading the dataset
with open('ibtracs.WP.list.v04r00.csv', 'r') as file:
    data = list(csv.reader(file, delimiter=','))
# remove CSV headers
ds = np.array(data[2:])

# selecting subsets of the data
mask_jtwc = ds[:,17] == 'jtwc_wp'
ds_jtwc = ds[mask_jtwc,:]
# remove unnecessary columns
columns_to_drop = [3,4]   list(range(8,13))   [14,15,17,18,21,22, 25]   list(range(38,161))
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
# further filtering
mask_nature = ds_jtwc[:,5] == 'TS'
ds_jtwc = ds_jtwc[mask_nature,:]
mask_tracktype = ds_jtwc[:,6] == 'main'
ds_jtwc = ds_jtwc[mask_tracktype,:]
mask_iflag = [True if item[0] != '_' else False for item in ds_jtwc[:,7]]
ds_jtwc = ds_jtwc[mask_iflag,:]
# remove columns that helped us perform the last step but not needed any more
columns_to_drop = [0,2,5,6,7]
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)
columns_to_drop = list(range(8, ds_jtwc.shape[1])) # representative columns only 
ds_jtwc = np.delete(ds_jtwc, columns_to_drop, 1)

# manual processing to handle empty data and timestamps
dataset = ds_jtwc.tolist()
converted_set = []
for row in dataset:
    converted_row = []
    for i in range(len(row)):
        if i == 1: # string type
            converted_row.append(str(row[i]))
        elif i == 2: # timestamp
            timestamp = datetime.strptime(row[i], '%Y-%m-%d %H:%M:%S')
            # timestamp = timestamp.replace(tzinfo=pytz.UTC) # no need timezones for modern numpy
            converted_row.append(timestamp)
        elif i == 3 or i == 4: # float type
            if row[i] == " ":
                converted_row.append(None) # NaN
            else: converted_row.append(float(row[i]))
        else: # default to integers
            if row[i] == " ":
                converted_row.append(-99999) # placeholder
            else: converted_row.append(int(row[i]))
    converted_set.append(converted_row)
dataset = np.array(converted_set)

# get sample data for reference
random_index = np.random.choice(dataset.shape[0], size=1, replace=False)
print("Sample data (row {0}):".format(random_index))
print(dataset[random_index, :])
print("Sample data (row 1):")
print(dataset[0])

### Code in question ###
print(dataset)
print(dataset.dtype)
dataset = dataset.astype([
    ('SEASON', 'i'), 
    ('NAME', 'S'),
    ('ISO_TIME', 'datetime64[s]'),
    ('USA_LAT', 'f'),('USA_LON', 'f'),
    ('USA_WIND', 'i'),('USA_PRES', 'i'),
    ('USA_R34_NE', 'i')
])
print(dataset)
print(dataset.dtype)

輸出：

Sample data (row [38692]):
[[1999 'MAGGIE' datetime.datetime(1999, 6, 8, 6, 0) 23.6 111.0 20 -99999
  -99999]]
Sample data (row 1):
[1945 'ANN' datetime.datetime(1945, 4, 19, 12, 0) 9.5 160.3 25 -99999
 -99999]
[[1945 'ANN' datetime.datetime(1945, 4, 19, 12, 0) ... 25 -99999 -99999]
 [1945 'ANN' datetime.datetime(1945, 4, 19, 18, 0) ... 30 -99999 -99999]
 [1945 'ANN' datetime.datetime(1945, 4, 20, 0, 0) ... 35 -99999 -99999]
 ...
 [2019 'PHANFONE' datetime.datetime(2019, 12, 28, 12, 0) ... 25 1009
  -99999]
 [2019 'PHANFONE' datetime.datetime(2019, 12, 28, 18, 0) ... 20 1011
  -99999]
 [2019 'PHANFONE' datetime.datetime(2019, 12, 29, 0, 0) ... 20 1010
  -99999]]
object
Traceback (most recent call last):
  File "D:\path\Documents\Programming\path\Dataset\_forstackoverflow.py", line 64, in <module>
    dataset = dataset.astype([
ValueError: invalid literal for int() with base 10: 'ANN'

If I do not perform the astype step, the data type will turn out to be object, but I believe it will be more convenient in the future if the data are already of the right types. I have also tried to specify sizes, but it gave me an identical error.

Code:

dataset = dataset.astype([
    ('SEASON', 'i4'), 
    ('NAME', 'U16'),
    ('ISO_TIME', 'datetime64[s]'),
    ('USA_LAT', 'f'),('USA_LON', 'f'),
    ('USA_WIND', 'i4'),('USA_PRES', 'i4'),
    ('USA_R34_NE', 'i4')
])

I wonder what is wrong with or missing in my astype call. Thanks in advance!

uj5u.com熱心網友回復：

為了說明我的最后一條評論：

In [9]: arr = np.array([[1,2,'word'],[3,4,'other']])
In [10]: arr
Out[10]: 
array([['1', '2', 'word'],
       ['3', '4', 'other']], dtype='<U21')
In [11]: arr.astype('i,i,U10')
Traceback (most recent call last):
  File "<ipython-input-11-3800d012c681>", line 1, in <module>
    arr.astype('i,i,U10')
ValueError: invalid literal for int() with base 10: 'word'

但是，如果我制作一個元組串列：

In [14]: alist = [tuple(row) for row in arr]
In [15]: alist
Out[15]: [('1', '2', 'word'), ('3', '4', 'other')]
In [16]: np.array(alist, dtype='i,i,U10')
Out[16]: 
array([(1, 2, 'word'), (3, 4, 'other')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U10')])

或者

In [17]: import numpy.lib.recfunctions as rf
In [19]: rf.unstructured_to_structured(arr, np.dtype('i,i,U10'))
Out[19]: 
array([(1, 2, 'word'), (3, 4, 'other')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<U10')])

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/348736.html

標籤：python numpy

上一篇：無法使用numpy的列印功能和均值功能

下一篇：添加華為套件時出現“Couldnotfindcom.huawei.hms:location:6.0.0.302”錯誤