串列之間的交集長度串列串列-有解無憂

注意：Numpy 向量化的幾乎重復：查找串列和串列串列之間的交集

差異：

當串列很大時，我會有效地專注于
我正在尋找最大的十字路口。

x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]

以下是一些假設：

y 是約 500 個元素的約 500,000 個子串列的串列
每個子串列y都是一個范圍，因此y由每個子串列的最后一個元素來表征。在示例中：3、7、9、12 ...
x 未排序
y 每個數字在 1 到 ~500000*500 之間包含一次且僅一次
y 在某種意義上是排序的，就像在示例中一樣，子串列已排序，并且一個子串列的第一個元素是前一個串列的最后一個元素的下一個元素。
y 早在編譯時就知道了

我的目的是要知道，在的子串列中y，至少有 10 個與x.

我顯然可以做一個回圈：

def find_best(x, y):
    result = []

    for index, sublist in enumerate(y):
        intersection = set(x).intersection(set(sublist))
        if len(intersection) > 2:  # in real live: > 10
            result.append(index)

    return(result)


x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4],  [5, 6], [7], [8, 9, 10, 11]]

res = find_best(x, y)
print(res)   # [0, 2]

這里的結果是[0,2]因為第一個和第三個子串列y有 2 個與相交的元素x。

另一種方法應該只決議一次y并計算 intesections ：

def find_intersec2(x, y):
    n_sublists = len(y)
    res = {num: 0 for num in range(0, n_sublists   1)}
    for list_no, sublist in enumerate(y):
        for num in sublist:
            if num in x:
                x.remove(num)
                res[list_no]  = 1
    return [n for n in range(n_sublists   1) if res[n] >= 2]

第二種方法更多地使用了假設。

問題：

什么優化是可能的？
有完全不同的方法嗎？索引，kdtree ? 在我的用例中，大串列y在實際運行前幾天就知道了。所以我不害怕建立一個索引或任何來自y. 小串列x僅在運行時才知道。

uj5u.com熱心網友回復：

由于y包含不相交的范圍并且它們的并集也是一個范圍，因此一個非常快速的解決方案是首先執行二進制搜索y，然后計算結果索引并僅回傳出現至少 10 次的索引。該演算法的復雜度是O(Nx log Ny)和Nx中Ny的項數分別為x和y。該演算法幾乎是最優的（因為x需要完全閱讀）。

實際執行

首先，您需要將您的 current 轉換y為一個 Numpy 陣列，其中包含所有范圍的起始值（按遞增順序），并N作為最后一個值（假設N不包括的范圍y，或N 1其他）。這部分可以假定為免費，因為y可以在您的情況下在編譯時計算。這是一個例子：

import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])

然后，您需要執行二進制搜索并檢查值是否適合 y 的范圍：

indices = np.searchsorted(y, x, 'right')

# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1

然后你需要計算索引并過濾那些至少：

uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]

這是一個基于您的示例：

x = np.array([1, 2, 3, 4, 5, 6])

# [[1, 2, 3], [4],  [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])

# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]

# [0, 2]
print(result.tolist())

它在我的機器上根據您的輸入約束在隨機輸入上運行不到 0.1 毫秒。

uj5u.com熱心網友回復：

將 y 變成 2 個字典。

index = { # index to count map
    0 : 0,
    1 : 0,
    2 : 0,
    3 : 0,
    4 : 0
}

y = { # elem to index map
    1: 0,
    2: 0,
    3: 0,
    4: 1,
    5: 2,
    6: 2,
    7: 3,
    8 : 4,
    9 : 4,
    10 : 4,
    11 : 4
}

既然你y提前知道了，我就不把上面的操作算進時間復雜度了。然后，計算交叉點：

x = [1, 2, 3, 4, 5, 6]
for e in x: index[y[e]]  = 1

由于您提到x的很小，因此我嘗試使時間復雜度僅取決于x（在這種情況下O(n)）的大小。

最后，答案是索引 dict 中的鍵串列，其中值 >= 2（或實際情況下為 10）。

answer = [i for i in index if index[i] >= 2]

uj5u.com熱心網友回復：

這用于y創建一個線性陣列，將每個 int 映射到 (1 plus)，即 int 所在的范圍或子組的索引；稱為x2range_counter。

x2range_counter使用 32 位 array.array 型別來節省記憶體并且可以被快取并用于所有 x相同的計算y。

計算特定范圍內每個范圍內的命中x只是count'er in function count_ranges 的間接陣列遞增。

y = [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11, 12]]
x = [5, 3, 1, 11, 8, 10]

range_counter_max = len(y)
extent = y[-1][-1]   1  # min in y must be 1 not 0 remember.
x2range_counter = array.array('L', [0] * extent)  # efficient 32 bit array storage

# Map any int in any x to appropriate ranges counter.
for range_counter_index, rng in enumerate(y, start=1):
    for n in rng:
        x2range_counter[n] = range_counter_index
print(x2range_counter)  # array('L', [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4])

# x2range_counter can be saved for this y and any x on this y.

def count_ranges(x: List[int]) -> List[int]:
    "Number of x-hits on each y subgroup in order"
    # Note: count[0] initially catches errors. count[1..] counts x's in y ranges [0..]
    count = array.array('L', [0] * (range_counter_max   1))
    for xx in x:
        count[x2range_counter[xx]]  = 1
    assert count[0] == 0, "x values must all exist in a y range and y must have all int in its range."

    return count[1:] 

print(count_ranges(x))  # array('L', [1, 2, 1, 2])

我為此創建了一個類，具有額外的功能，例如回傳范圍而不是索引；所有范圍都命中 >=M 次；(range, hit-count) 元組最先排序。

不同 x 的范圍計算與 x 成正比，并且是簡單的陣列查找，而不是任何 dicts 散列。

你怎么認為？

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/419812.html

標籤：

上一篇：在新創建的span中搜索并放置子HTML文本，盡可能保留HTML結構

下一篇：氣流錯誤的log_id格式