我有一個陣列,我使用 np. 在 csv 檔案上加載文本。
dataresale = np.loadtxt(
resale, skiprows=1, usecols=(0,2,10),
dtype=[('month', 'U50'),
('flat_type', 'U50'),
('resale_price', 'f8')], delimiter=',')
print(dataresale['month'])
下面是輸出:
['2017-01' '2017-01' '2017-01' ... '2021-03' '2021-10' '2021-12']
我只想取出 2021 年(所有月份)的資料
下面是我用來在另一個陣列中按年份取出行的腳本,但是這個特定的資料集有月份標記
x = datap[datax['year'] == 2019]
有沒有辦法可以修改上面的腳本以取出所有 2021 資料?
uj5u.com熱心網友回復:
構造一個樣本陣列:
In [359]: arr = np.zeros(6, dtype=[('month', 'U50'),
...: ('flat_type', 'U50'),
...: ('resale_price', 'f8')])
In [360]: arr['month']=['2017-01', '2017-01', '2017-01','2021-03', '2021-10', '2
...: 021-12']
以。。開始
由于興趣是字符的第一個,我們可以這樣做:
In [362]: np.char.startswith(arr['month'],'2021')
Out[362]: array([False, False, False, True, True, True])
這實際上是:
In [364]: [s.startswith('2021') for s in arr['month']]
Out[364]: [False, False, False, True, True, True]
串列理解更快,但為了更好的比較,讓我們獲取索引:
In [366]: timeit np.nonzero([s.startswith('2021') for s in arr['month']])
15.1 μs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [367]: timeit np.nonzero(np.char.startswith(arr['month'],'2021'))
16.7 μs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
型別截斷
但是astype是截斷字串 dtypes 的一種相對快速的方法,實際上[:4]是字串切片的型別:
In [371]: arr['month'].astype('U4')
Out[371]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [372]: arr['month'].astype('U4')=='2021'
Out[372]: array([False, False, False, True, True, True])
In [374]: timeit np.nonzero(arr['month'].astype('U4')=='2021')
6.47 μs ± 7.53 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
日期時間[Y]
另一種選擇是將字串轉換為 datetime64
In [376]: arr['month'].astype('datetime64[Y]')
Out[376]:
array(['2017', '2017', '2017', '2021', '2021', '2021'],
dtype='datetime64[Y]')
隨著轉換時間:
In [379]: timeit np.nonzero(arr['month'].astype('datetime64[Y]')==np.array('2021
...: ','datetime64[Y]'))
17.5 μs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
如果我們可以證明提前進行轉換是合理的:
In [380]: %%timeit yrs = arr['month'].astype('datetime64[Y]')
...: np.nonzero(yrs==np.array('2021','datetime64[Y]'))
6.2 μs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
字符切片
In [396]: char_slice(arr['month'],0,4)
Out[396]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [397]: char_slice(arr['month'],0,4)=='2021'
Out[397]: array([False, False, False, True, True, True])
In [398]: timeit np.nonzero(char_slice(arr['month'],0,4)=='2021')
37.2 μs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
uj5u.com熱心網友回復:
因此,一般來說,numpy.ndarray物件對字串操作的支持有限。值得注意的是,字串切片似乎不存在。如果你看一下類似的問題,你可以從前面至少使用視圖(用小亂砍片N的UN型別)。但是,由于您的陣列是結構化 dtype,因此它不喜歡創建視圖。
但是,在這種特殊情況下,您可以使用該np.char.startswith函式。
一些示例資料(以后請務必提供這個,您是來這里尋求幫助的,不要讓人們努力使您自己的問題易于回答,這實際上是規則的一部分,但也只是常見的禮貌) :
(py39) Juans-MBP:workspace juan$ cat resale.csv
2017-01,foo,4560.0
2019-01,bar,3432.34
2017-01,baz,34199.5
2019-01,baz,3232.34
2017-01,bar,932.34
好的,所以使用上面的:
In [1]: import numpy as np
In [2]: resale = "resale.csv"
In [3]: data = np.loadtxt(resale,dtype=[('month','U50'),('flat_type','U50'),
...: ('resale_price','f8')],delimiter=',')
In [4]: data
Out[4]:
array([('2017-01', 'foo', 4560. ), ('2019-01', 'bar', 3432.34),
('2017-01', 'baz', 34199.5 ), ('2019-01', 'baz', 3232.34),
('2017-01', 'bar', 932.34)],
dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])
In [5]: np.char.startswith(data['month'], "2019")
Out[5]: array([False, True, False, True, False])
In [6]: data[np.char.startswith(data['month'], "2019")]
Out[6]:
array([('2019-01', 'bar', 3432.34), ('2019-01', 'baz', 3232.34)],
dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])
或者,在這種情況下,您正在處理日期,這是 中支持的型別numpy,因此您可以使用以下 dtype:'datetime64[D]'這將是 datetime64 但通過為您填寫日期來決議:
In [14]: data = np.loadtxt(resale,dtype=[('month','datetime64[D]'),('flat_type','U50'),
...: ('resale_price','f8')],delimiter=',')
In [8]: data
Out[8]:
array([('2017-01-01', 'foo', 4560. ), ('2019-01-01', 'bar', 3432.34),
('2017-01-01', 'baz', 34199.5 ), ('2019-01-01', 'baz', 3232.34),
('2017-01-01', 'bar', 932.34)],
dtype=[('month', '<M8[D]'), ('flat_type', '<U50'), ('resale_price', '<f8')])
然后你可以使用類似的東西:
In [9]: data['month'] >= np.datetime64("2019")
Out[9]: array([False, True, False, True, False])
uj5u.com熱心網友回復:
如果你堅持使用 numpy,你可以使用切片提取你想要整數格式的資料:
strs = dayaresale['month'].copy()[:, None].view('U1')
year = strs[:, :4].view('U4').astype(int).ravel()
month = strs[:, 5:7].view('U2').astype(int).ravel()
轉換為 2D 并在最后進行 ravel 允許'S1'視圖擴展為列。復制是必要的,因為資料不是完全連續的(盡管它在列維度中)。
您從這些陣列構造的任何掩碼都將適用于原始掩碼,例如:
dataresale[year == 2021]
聚苯乙烯
副本真的讓我很困擾,因為原始資料顯然“足夠連續”以避免它。如果字串的元素不在連續塊中,則可以理解。因此,我提出以下字串切片替代方案,它實際上在某些方面更便宜和更簡單:
yoffset = dataresale.dtype.fields['month'][1]
year = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=yoffset, strides=dataresale.strides, dtype='U4').astype(int)
moffset = dataresale.dtype.fields['month'][1] dataresale.dtype.fields['month'][0].itemsize // 50 * 5
month = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=moffset, strides=dataresale.strides, dtype='U2').astype(int)
繳費靈
numpy 中沒有通用的字串切片方法真的讓我很困擾。鑒于上面的例子,實作起來似乎很簡單。所以這里有一個更通用的解決方案:
def char_slice(a, start, stop=None):
"""
Apply `slice` to each element of string array `a`.
Parameters
----------
a : array-like
Must contain a ascii or unicode elements.
start, stop : int
The limits of the slice. Only contiguous one-directional
slices are supported. If only `start` is provided, the
slice is interpreted as `(0, start)`, as ususal in python.
Bounds past the ends of the string are silently truncated,
as with most python slicing. Negative slice values are
interpreted relative to the end of the datatype, not
necessarily contents of the individual elements. `start` is
inclusive, while `stop` is exclusive. `start <= stop` is
required after all other adjustments have been made.
Return
------
slice : np.ndarray
A view of the original data, sliced to show strings of the
required size. The dimensions of the array will be the same
as those of the input, and the datatype will be `SN` or
`UN`, with the same byte order and code as the input, but
with `N = stop - start`.
Note
----
There are two circumstances under which a view can not be
returned. The simplest is when the data is not in a suitable
format, such as a list or other array-like. As a rule of thumb,
anything that `numpy.asanarray` would copy becomes a copy. The
second circumstance is when the original base of the array `a`
is non-contiguous. `a` itself does not have to be contiguous for
a view to be successfully constructed.
"""
a = np.asanyarray(a)
dtype = a.dtype
if dtype.char not in 'US':
raise TypeError(f'Only U and S string datatypes supported. Found {dtype.char}')
length = int(dtype.str[2:])
# Adjust the bounds using a slice object
if stop is None:
start, stop = 0, start
start, stop, step = slice(start, stop).indices(length)
if start > stop or step != 1:
raise ValueError('Invalid start-stop combination. Start <= stop required after adjustment.')
# Get the real dtype information
charsize = dtype.itemsize // length
# Find the real base array
base = a
while base.base is not None:
base = base.base
realoffset = a.__array_interface__['data'][0] - base.__array_interface__['data'][0]
newoffset = start * charsize realoffset
newdtype = np.dtype(f'{dtype.str[:2]}{stop - start}')
try:
newarray = np.ndarray(buffer=base, offset=newoffset, shape=a.shape, strides=a.strides, dtype=newdtype)
except ValueError as e:
if str(e) == 'ndarray is not contiguous':
a = a.copy()
newarray = np.ndarray(buffer=a, offset=start * charsize, shape=a.shape, strides=a.strides, dtype=newdtype)
else:
raise
return newarray
現在你需要做的就是度過這一年
year = char_slice(strs = dayaresale['month'], 4).astype(int)
uj5u.com熱心網友回復:
我認為您可以在pandas.Series的幫助下完成,如下所示:
dataresale[pd.Series(dataresale['month']).str.match(r'^2021-')]
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/401964.html
上一篇:如何通過函式傳遞一系列數字?
下一篇:從緊耦合線和噪聲曲線中尋找直線
