嗨,大家。很抱歉打擾你。
我有這個任務,我有一個存盤在一個串列中的哈希編碼串列,其中包含 30 個位置,值為 0 和 1。總的來說,我有超過 10000 個這樣的 30 大小(0/1)哈希碼,我想找到所有對這種哈希碼的差異低于給定閾值(例如 0、1、5),在這種情況下,這對將被視為“相似”哈希碼。
我已經在python3中使用雙“for回圈”實作了這一點(見下面的代碼),但我覺得它不夠高效,因為這似乎是一個O(N!),當N = 10000或時它確實很慢更大。
我的問題是有沒有更好的方法可以加快尋找相似哈希對的速度?理想情況下,我想在 O(N) 中?
注意效率我的意思是在給定的情況下找到相似的對而不是生成哈希編碼(這僅用于演示)。
我已經深入研究了這個問題,我找到的所有答案都是關于使用某種收集工具來查找相同的對,但在這里我有一個更一般的情況,即在給定閾值的情況下,這些對也可能是相似的。
我提供了生成示例散列編碼的代碼和我正在使用的當前低效程式。我希望你會發現這個問題很有趣,希望一些更好/更聰明/高級的程式員可以幫我解決這個問題。提前致謝。
import random
import numpy as np
# HashCodingSize = 10
# Just use this to test the program
HashCodingSize = 100
# HashCodingSize = 1000
# What can we do when we have the list over 10000, 100000 size ?
# This is where the problem is
# HashCodingSize = 10000
# HashCodingSize = 100000
#Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
random.seed(seed)
listLength = 30
numZero = random.randint(1, listLength)
numOne = listLength - numZero
my_list = [0] * numZero [1] * numOne
random.shuffle(my_list)
# print(my_list)
outputCodingAllPy.append(my_list)
#Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))
hashDiffThreshold = 0
#hashDiffThreshold = 1
#hashDiffThreshold = 5
loopRange = range(outputCodingAll.shape[0])
samePairList = []
#This is O(n!) I suppose, is there better way ?
for i in loopRange:
for j in loopRange:
if j > i:
if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
print("The pair (", str(i), ", ", str(j), ") ")
samePairList.append([i, j])
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(samePairList)
uj5u.com熱心網友回復:
如果您只需要 30 位向量,那么將其表示為 32 位整數中的 30 位會更好。那么兩個“向量”之間的漢明距離就是xor兩個整數的位數。有計算整數中非零位數的有效演算法。這些可以很容易地使用numpy.
所以演算法是:
- 生成
HashCodingSize0 到 (1<<30)-1 之間的隨機整數。那是一條線numpy.random.randint() - 對于每個值與陣列進行異或運算(參見 參考資料
numpy.bitwise_xor),計算每個異或輸出值中的位數(向量化位計數演算法之一),并找到位計數小于或等于的索引hashDiffThreshold
這仍然是 O(n^2),但只是 python 中的一個回圈;回圈中的每個操作都在一個長度為 n 的向量上進行numpy呼叫。
uj5u.com熱心網友回復:
只要您listLength在計算機上的整數大小范圍內,我就會改用整數。然后,您可以xor將這些值(使用廣播一次對所有值進行異或)來獲得不同的位數,對這些位求和,然后用于nonzero查找符合哈希差異要求的索引。例如:
import numpy as np
import random
HashCodingSize = 10
listLength = 30
outputCodingAll = np.array([random.choice(range(2**listLength)) for _ in range(HashCodingSize)])
# sample result
# array([995834408, 173548139, 717311089, 87822983, 813938401,
# 363814224, 970707528, 907497995, 337492435, 361696322])
distance = bit_count(outputCodingAll[:, np.newaxis] ^ outputCodingAll)
# sample result
# array([[ 0, 10, 15, 18, 14, 18, 8, 12, 18, 16],
# [10, 0, 13, 14, 16, 24, 14, 14, 16, 18],
# [15, 13, 0, 23, 13, 15, 15, 17, 19, 15],
# [18, 14, 23, 0, 18, 16, 18, 12, 12, 14],
# [14, 16, 13, 18, 0, 16, 12, 14, 14, 14],
# [18, 24, 15, 16, 16, 0, 14, 16, 12, 6],
# [ 8, 14, 15, 18, 12, 14, 0, 12, 18, 14],
# [12, 14, 17, 12, 14, 16, 12, 0, 14, 14],
# [18, 16, 19, 12, 14, 12, 18, 14, 0, 12],
# [16, 18, 15, 14, 14, 6, 14, 14, 12, 0]], dtype=int32)
hashDiffThreshold = 10
samePairList = np.transpose(np.nonzero(distance < hashDiffThreshold))
# sample result
# array([[0, 0],
# [0, 6],
# [1, 1],
# [2, 2],
# [3, 3],
# [4, 4],
# [5, 5],
# [5, 9],
# [6, 0],
# [6, 6],
# [7, 7],
# [8, 8],
# [9, 5],
# [9, 9]], dtype=int64)
請注意,結果重復對(例如 [5, 9] 和 [9, 5]),因為它們都作為第一個和第二個運算元進行了測驗)。它還包括針對自身測驗的每個值(顯然是0)。如果需要,可以輕松過濾掉這些結果。
請注意,如果要將任何值轉換為串列,1并且0可以將數字格式化為長度的二進制字串listLength并將每個字符映射到 int,例如
list(map(int, f'{outputCodingAll[0]:0{listLength}b}'))
# sample output
# [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]
此代碼使用此答案bit_count中的函式:
def bit_count(arr):
# Make the values type-agnostic (as long as it's integers)
t = arr.dtype.type
mask = t(-1)
s55 = t(0x5555555555555555 & mask) # Add more digits for 128bit support
s33 = t(0x3333333333333333 & mask)
s0F = t(0x0F0F0F0F0F0F0F0F & mask)
s01 = t(0x0101010101010101 & mask)
arr = arr - ((arr >> 1) & s55)
arr = (arr & s33) ((arr >> 2) & s33)
arr = (arr (arr >> 4)) & s0F
return (arr * s01) >> (8 * (arr.itemsize - 1))
uj5u.com熱心網友回復:
此變體產生與您的代碼相同的結果
outer_not_equal = np.not_equal.outer(outputCodingAll, outputCodingAll)
diff_count_matrix = outer_not_equal.sum((1, 3)) // outputCodingAll.shape[1]
samePairNumpy = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
# filtering out diagonal values
samePairNumpy = samePairNumpy[samePairNumpy[:, 0] != samePairNumpy[:, 1]]
# filtering out duplicates above diagonal
samePairNumpy.sort(axis=1)
samePairNumpy = np.unique(samePairNumpy, axis=0)
array([[22, 23],
[22, 54],
[22, 62],
[22, 70],
[22, 93],
[23, 54],
[23, 62],
[23, 70],
[23, 93],
[54, 62],
[54, 70],
[54, 93],
[62, 70],
[62, 93],
[70, 93]], dtype=int64)
與您的代碼產生的結果相同
#Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
random.seed(seed)
listLength = 30
numZero = random.randint(1, listLength)
numOne = listLength - numZero
my_list = [0] * numZero [1] * numOne
random.shuffle(my_list)
# print(my_list)
outputCodingAllPy.append(my_list)
#Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))
hashDiffThreshold = 0
#hashDiffThreshold = 1
#hashDiffThreshold = 5
loopRange = range(outputCodingAll.shape[0])
samePairList = []
#This is O(n!) I suppose, is there better way ?
for i in loopRange:
for j in loopRange:
if j > i:
if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
print("The pair (", str(i), ", ", str(j), ") ")
samePairList.append([i, j])
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(samePairList)
[[22, 23], [22, 54], [22, 62], [22, 70], [22, 93], [23, 54], [23, 62], [23, 70], [23, 93], [54, 62], [54, 70], [54, 93], [62, 70], [62, 93], [70, 93]]
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/521943.html
