我正在嘗試實作更快的np.isinin版本numba,這就是我目前所擁有的:
import numpy as np
import numba as nb
@nb.njit(parallel=True)
def isin(a, b):
out=np.empty(a.shape[0], dtype=nb.boolean)
b = set(b)
for i in nb.prange(a.shape[0]):
if a[i] in b:
out[i]=True
else:
out[i]=False
return out
對于數字,它可以作業,如本例所示:
a = np.array([1,2,3,4])
b = np.array([2,4])
isin(a,b)
>>> array([False, True, False, True])
它比np.isin:
a = np.random.rand(20000)
b = np.random.rand(5000)
%time isin(a,b)
CPU times: user 3.96 ms, sys: 0 ns, total: 3.96 ms
Wall time: 1.05 ms
%time np.isin(a,b)
CPU times: user 11 ms, sys: 0 ns, total: 11 ms
Wall time: 8.48 ms
但是,我想將此函式與包含字串的陣列一起使用。問題是,每當我嘗試傳遞一個字串陣列時,都會numba抱怨它無法set()用這些資料解釋操作。
a = np.array(['A','B','C','D'])
b = np.array(['B','D'])
isin(a,b)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'set'>) found for signature:
>>> set(array([unichr x 1], 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload of function 'set': File: numba/core/typing/setdecl.py: Line 20.
With argument(s): '(array([unichr x 1], 1d, C))':
No match.
During: resolving callee type: Function(<class 'set'>)
During: typing of call at /tmp/ipykernel_20582/4221597969.py (7)
File "../../../../tmp/ipykernel_20582/4221597969.py", line 7:
<source missing, REPL/exec in use?>
有沒有辦法,比如指定一個簽名,可以讓我直接在字串陣列上使用它?
我知道我可以為每個字串分配一個數值,但是對于大型陣列,我認為這需要一段時間,并且會使整個事情比僅使用np.isin.
有任何想法嗎?
uj5u.com熱心網友回復:
Strings are barely supported by Numba (like bytes although the support is slightly better). Set and dictionary are supported with some strict restriction and are quite experimental/new. Sets of strings are not supported yet regarding the documentation:
Sets must be strictly homogeneous: Numba will reject any set containing objects of different types, even if the types are compatible (for example, {1, 2.5} is rejected as it contains a
intand afloat). The use of reference counted types, e.g. strings, in sets is unsupported.
You can try to cheat using a binary search. Unfortunately, np.searchsorted is not yet implemented for string-typed Numpy array (although np.unique is). I think you can implement a binary search yourself but this is cumbersome to do and in the end. I am not sure this will be faster in the end, but I think it should because of the O(Ns Na log Nb)) running time complexity (with Ns the average size of string length of b unique items, Na the number of items in a and Nb the number of unique items in b). Indeed, the running time complexity of np.isin is O(Ns (Na Nb) log (Na Nb)) if arrays are about similar sizes and O(Ns Na Nb) if Nb is much smaller than Na. Note that the best theoretical running time complexity is AFAIK O(Ns (Na Nb)) thanks to a hash table with a good hash function (Tries can also achieve this but they should be practically slower unless the hash function is not great).
Note that typed dictionaries support statically declared string but not dynamic ones (and this is an experimental feature for static ones).
Another cheat (that should work) is to store string hashes as the key of a typed dictionary and associate each hash to an array of indices referencing the string locations in b having the hash of the associated key. The parallel loop needs to hash a items and fetch the location of strings item in b having this hash so you can then compare the strings. A faster implementation would be to assume that the hash function for b strings is perfect and that there is no collision so you can directly use a TypedDict[np.int64, np.int64] hash table. You can test this hypothesis at runtime when you build b. Writing such a code is a bit tedious. Note that this implementation may not be faster than Numpy in the end in sequential because Numba TypedDicts are currently pretty slow... However, this should be faster in parallel on processors with enough cores.
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/422651.html
標籤:
上一篇:具有最大表單大小滯后的全屏表單
