為什么numpy笛卡爾積比純python版本慢？-有解無憂

輸入

import numpy as np
import itertools

a = np.array([ 1,  6,  7,  8, 10, 11, 13, 14, 15, 19, 20, 23, 24, 26, 28, 29, 33,
       34, 41, 42, 43, 44, 45, 46, 47, 52, 54, 58, 60, 61, 65, 70, 75]).astype(np.uint8)
b = np.array([ 2,  3,  4, 10, 12, 14, 16, 20, 22, 26, 28, 29, 30, 31, 34, 36, 37,
       38, 39, 40, 41, 46, 48, 49, 50, 52, 53, 55, 56, 57, 59, 60, 63, 66,
       67, 68, 69, 70, 71, 74]).astype(np.uint8)
c = np.array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75]).astype(np.uint8)

我想獲得 3 個陣列的笛卡爾積，但我不希望一行中的任何重復元素[1, 2, 1]無效，并且這兩個元素中只有一個有效，[10, 14, 0]或者[14, 10, 0]因為 10 和 14 都在a和中b。

僅限 Python

def no_numpy():
    combos = {tuple(set(i)): i for i in itertools.product(a, b, c)}
    combos = [val for key, val in combos.items() if len(key) == 3]
%timeit no_numpy() # 32.5 ms ± 508 μs per loop

麻木的

# Solution from (https://stackoverflow.com/a/11146645/18158000)
def cartesian_product(*arrays):
    broadcastable = np.ix_(*arrays)
    broadcasted = np.broadcast_arrays(*broadcastable)
    rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
    dtype = np.result_type(*arrays)

    out = np.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end   rows
    return out.reshape(cols, rows).T

def numpy():
    combos = {tuple(set(i)): i for i in cartesian_product(*[a, b, c])}
    combos = [val for key, val in combos.items() if len(key) == 3]
%timeit numpy() # 96.2 ms ± 136 μs per loop

我的猜測是在 numpy 版本中將轉換np.array為 set 是為什么它要慢得多，但是在嚴格比較時，獲取初始產品cartesian_product比itertools.product.

無論如何可以修改 numpy 版本以優于純 python 解決方案，還是有另一種優于兩者的解決方案？

uj5u.com熱心網友回復：

為什么當前的實作很慢

雖然第一個解決方案比第二個解決方案更快，但它的效率非常低，因為它創建了許多臨時 CPython 物件（每個專案至少 6 個itertools.product）。創建大量物件的成本很高，因為它們是由 CPython 動態分配和參考計數的。Numpy 函式cartesian_product非常快，但對結果陣列的迭代非常慢，因為它創建了很多 Numpy 視圖并在numpy.uint8而不是 CPython上運行int。Numpy 型別和函式為非常小的陣列引入了巨大的開銷。

如@AlainT 所示，Numpy 可用于加速此操作，但這并非易事，而且 Numpy 無法解決此類問題。

如何提高性能

一種解決方案是使用Numba自己更有效地完成作業，并讓 Numba 的JIT 編譯器優化回圈。您可以使用 3 個嵌套回圈來有效地生成笛卡爾積的值并過濾專案。字典可用于跟蹤已經看到的值。可以將 3 個專案的元組打包成一個整數，以減少記憶體占用并提高性能（因此字典可以更好地適應 CPU 快取并避免創建緩慢的元組物件）。

這是生成的代碼：

import numba as nb

# Signature of the function (parameter types)
# Note: `::1` means the array is contiguous
@nb.njit('(uint8[::1], uint8[::1], uint8[::1])')
def with_numba(a, b, c):
    seen = dict()

    for va in a:
        for vb in b:
            for vc in c:
                # If the 3 values are different
                if va != vb and vc != vb and vc != va:
                    # Sort the 3 values using a fast sorting network
                    v1, v2, v3 = va, vb, vc
                    if v1 > v2: v1, v2 = v2, v1
                    if v2 > v3: v2, v3 = v3, v2
                    if v1 > v2: v1, v2 = v2, v1

                    # Compact the 3 values into one 32-bit integer
                    packedKey = (np.uint32(v1) << 16) | (np.uint32(v2) << 8) | np.uint32(v3)

                    # Is the sorted tuple (v1,v2,v3) already seen?
                    if packedKey not in seen:
                        # Add the value and remember the ordered tuple (va,vb,vc)
                        packedValue = (np.uint32(va) << 16) | (np.uint32(vb) << 8) | np.uint32(vc)
                        seen[packedKey] = packedValue

    res = np.empty((len(seen), 3), dtype=np.uint8)

    for i, packed in enumerate(seen.values()):
        res[i, 0] = np.uint8(packed >> 16)
        res[i, 1] = np.uint8(packed >> 8)
        res[i, 2] = np.uint8(packed)

    return res

with_numba(a, b, c)

基準

以下是我的 i5-9600KF 處理器上的結果：

numpy:                         122.1 ms  (x 1.0)
no_numpy:                       49.6 ms  (x 2.5)
AlainT's solution:              49.0 ms  (x 2.5)
mathfux's solution              34.2 ms  (x 3.5)
mathfux's optimized solution     7.5 ms  (x16.2)
with_numba:                      4.9 ms  (x24.9)

提供的解決方案比最慢的實作快約 25 倍，比迄今為止提供的最快實作快約 1.5 倍。

當前的 Numba 代碼受限于 Numba 字典操作的速度。可以使用更多低級技巧來優化代碼。解決方案是用大小的打包布爾陣列（1 項 = 1 位）替換字典，256**3/8以跟蹤已經看到的值（通過檢查第packedKeyth 位）。res如果未設定獲取位，則可以直接添加打包值。這需要res預先分配到最大大小或實作指數增長的陣列（如list在 Python 或std::vectorC 中）。另一個優化是對串列進行排序并使用平鋪策略以提高快取區域性. 這種優化遠非容易實作，但我希望它們能大大加快執行速度。

如果您打算使用更多陣列，那么哈希映射可能會成為瓶頸，而位陣列可能會非常大。雖然使用平鋪確實有助于減少記憶體占用，但您可以使用布隆過濾器大大加快實作速度。這種概率資料結構可以通過跳過許多重復項來加快執行速度，而不會導致任何快取未命中并且記憶體占用少。您可以洗掉大部分重復項，然后對陣列進行排序，然后洗掉重復項。關于您的問題，基數排序可能比通常的排序演算法更快。

uj5u.com熱心網友回復：

你可以這樣做：

# create full Cartessian product and keep items in sorted form
arr = np.stack(np.meshgrid(a, b, c), axis=-1).reshape(-1, 3)
arr_sort = np.sort(arr, axis=1)

# apply condition 1: no duplicates between sorted items
u, idx_c1 = np.unique(arr_sort, return_index=True, axis=0)
arr_filter, arr_sort_filter = arr[idx_c1], arr_sort[idx_c1]

# apply condition 2: no items with repeated values between sorted items
idx_c2 = (arr_sort_filter[:,0] != arr_sort_filter[:,1]) & \
           (arr_sort_filter[:,1] != arr_sort_filter[:,2])

arr_filter[idx_c2]
>>>
array([[ 1,  2,  0],
       [ 1,  3,  0],
       [ 1,  4,  0],
                ...,
       [75, 71, 74],
       [75, 74, 72],
       [75, 74, 73]], dtype=uint8)

我的電腦需要 57 毫秒，no_numpy(args?)而回傳 50014 項需要 77 毫秒。

您可以稍后分析此演算法以查看可以優化的內容。我是手動完成的，但是找到一些分析工具是個好主意:)

arr ~0.2 毫秒
arr_sort ~1.4ms
u, idx_c1 ~ 52ms
剩余部分~2.5ms

所以很容易看到這里一直在消耗什么。使用降維可以顯著改善它。一種方法是替換

u, idx_c1 = np.unique(arr_sort, return_index=True, axis=0)

和

M = max(np.max(a), np.max(b), np.max(c))
idx = np.ravel_multi_index(arr_sort.T, (M 1, M 1, M 1))
u, idx_c1 = np.unique(idx, return_index=True)

它現在只運行 4.5 毫秒，總共只運行 9 毫秒！我猜如果你優化了這些部分，你可以將這個演算法加速 3 倍：

用于numba更快的比較idx_c2
用于numba加速np.ravel_multi_index（即使在中手動實作也更快numpy）
使用numba或numpy版本np.bincount代替np.unique

uj5u.com熱心網友回復：

讓 numpy 與過濾后的 python 迭代器一樣快是相當困難的，因為 numpy 處理的整個結構將不可避免地大于過濾集的結果。

這是我能想出的最好的處理陣列乘積的方法，即根據不同值的唯一組合過濾結果：

def npProductSets(a,b,*others):
    if len(a.shape)<2 : a = a[:,None]
    if len(b.shape)<2 : b = b[:,None]
    left  = np.repeat(a,b.shape[0],axis=0)
    right = np.tile(b,(a.shape[0],1))
    distinct = ~np.any(right==left,axis=1)
    prod  = np.concatenate((left[distinct],right[distinct]),axis=1)
    prod.sort(axis=1)
    prod  = np.unique(prod,axis=0) 
    if others:
        return npProductSets(prod,*others)
    return prod

這個 npProductSets 函式過濾擴展的陣列，并使用 numpy 方法。不過，它的運行速度仍然比 Python 生成器慢（0.078 秒對 0.054 秒）。Numpy 不是組合學和集合操作的理想工具。

請注意， npProductSets 回傳 50014 專案而不是您的 58363 因為tuple(set(i))不會過濾數字的所有排列。將集合轉換為元組并不能保證元素的順序（因此由于置換專案，重復的組合會包含在您的輸出中）。

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/432691.html

標籤：Python 麻木的表现

上一篇：這些渦扇記憶區是什么意思？

下一篇：Python：帶有額外索引的快速矩陣乘法