我有一個作業代碼來進行一些計算并創建一個資料幀,但是如果 id:s 考慮的數字增長(實際上,時間呈指數增長),則需要相當長的時間。
所以,如果情況是這樣:我有一個由向量??組成的資料框,每個向量一個id:
id vector
0 A4070270297516241 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,...
1 A4060461064716279 [0, 2, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 A4050500015016271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, ...
3 A4050494283416274 [15, 13, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
4 A4050500876316279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 A4050494111016270 [6, 10, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
6 A4050470673516272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 A4060461035616276 [0, 0, 0, 11, 0, 15, 13, 0, 5, 3, 0, 0, 0, 0, ...
8 A4050500809916271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 A4050500822216279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, ...
10 A4050494817416277 [0, 0, 0, 0, 0, 4, 9, 0, 5, 8, 0, 15, 0, 0, 8,...
11 A4060462005116279 [15, 12, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
12 A4050500802316278 [0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 0, 15, 12, 0, 8...
13 A4050500841416272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 5, ...
14 A4050494856516271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, ...
15 A4060462230216270 [0, 0, 2, 2, 15, 15, 10, 0, 0, 0, 0, 0, 0, 0, ...
16 A4090150867216273 [0, 0, 0, 0, 0, 0, 0, 13, 6, 3, 0, 2, 0, 15, 4...
17 A4060464010916275 [0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
18 A4139311891213540 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
19 A4050500938416279 [0, 10, 11, 6, 6, 2, 4, 0, 0, 0, 0, 0, 0, 0, 0...
20 A4620451871516274 [0, 0, 0, 0, 15, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
21 A4060460331116279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 15, 0, 2,...
我dict在問題的末尾提供了一個以避免混亂。
現在,我要做的就是,我決定,為每個id哪些其他id最接近通過計算每個之間的加權距離vector,并創建一個資料幀存盤資料:
ids = list(set(df1.id))
Closest_identifier = pd.DataFrame(columns = ['id','Identifier','Closest identifier','distance'])
我的代碼是這樣的:
import time
t = time.process_time()
for idnr in ids:
df_identifier = df1[df1['id'] == idnr]
identifier = df_identifier['vector'].to_list()
base_identifier = np.array(df_identifier['vector'].to_numpy().tolist())
Number_of_devices =len(np.nonzero(identifier)[1])
df_other_identifier = df1[df1['id'] != idnr]
other_id = list(set(df_other_identifier.id))
for id_other in other_id:
gf_identifier = df_other_identifier[df_other_identifier['id']==id_other]
identifier_other = np.array(gf_identifier['vector'].to_numpy().tolist())
dist = np.sqrt(np.sum((base_identifier - identifier_other)**2/Number_of_devices))
Closest_identifier = Closest_identifier.append({'id':id_other,'Identifier':base_identifier, 'Closest identifier':identifier_other, 'distance':dist},ignore_index=True)
elapsed_time = time.process_time() - t
print(elapsed_time)
6.0625
解釋發生了什么:在代碼的第一部分,我選擇了一個id并設定了我需要的所有資訊。設備的數量是與該相關聯的向量的非零值id的數量(即,檢測到具有該物件的設備的數量id)。在第二部分中,我計算了它id與所有其他物體的距離。
因此,對于每個id,我都有n-1行,其中 n 是我的id集合的長度。所以,對于 50 個 id,我有 50*50-50 = 2450 行
這里給出的時間是 50 個 ID。對于 200,回圈完成的時間為 120 秒,對于 400,時間為 871 秒。如您所見,時間在這里呈指數增長。我有 1700 個 ID,這需要幾天時間才能完成。
我的問題是:有沒有更有效的方法來做到這一點?
感謝您的見解。
Test data
{'id': {0: 'A4070270297516241',
1: 'A4060461064716279',
2: 'A4050500015016271',
3: 'A4050494283416274',
4: 'A4050500876316279',
5: 'A4050494111016270',
6: 'A4050470673516272',
7: 'A4060461035616276',
8: 'A4050500809916271',
9: 'A4050500822216279',
10: 'A4050494817416277',
11: 'A4060462005116279',
12: 'A4050500802316278',
13: 'A4050500841416272',
14: 'A4050494856516271',
15: 'A4060462230216270',
16: 'A4090150867216273',
17: 'A4060464010916275',
18: 'A4139311891213540',
19: 'A4050500938416279',
20: 'A4620451871516274',
21: 'A4060460331116279',
22: 'A4060454590916277',
23: 'A4060454778016276',
24: 'A4060462019716270',
25: 'A4050500945416277',
26: 'A4050494267716279',
27: 'A4090281644816244',
28: 'A4050500929516270',
29: 'N4010442537213363',
30: 'A4050500938216277',
31: 'A4060454598916275',
32: 'A4050494086216273',
33: 'A4060462859616271',
34: 'A4060454600116271',
35: 'A4050494551816276',
36: 'A4610490015816279',
37: 'A4060454605416279',
38: 'A4060454665916270',
39: 'A4060454579316278',
40: 'A4060464023516275',
41: 'A4050500588616272',
42: 'A4050500905516274',
43: 'A4070262442416243',
44: 'A4050500946716271',
45: 'A4070271195016244',
46: 'A4060454663216271',
47: 'A4060454590416272',
48: 'A4060461993616279',
49: 'N4010442139713366'},
'vector': {0: [0,
0,
0,
0,
0,
0,
0,
0,
7,
4,
0,
0,
0,
13,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
1: [0,
2,
0,
0,
6,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
9,
14],
2: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
3,
0,
0,
0,
5,
12,
15,
2,
0,
0,
0,
0,
0,
0],
3: [15,
13,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
5],
4: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
15,
0,
0,
0,
0,
0,
0],
5: [6,
10,
1,
0,
2,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
10,
13,
15],
6: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
15,
7,
2,
0,
0,
0],
7: [0,
0,
0,
11,
0,
15,
13,
0,
5,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
8: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
10,
2,
0,
0],
9: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
8,
15,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0],
10: [0,
0,
0,
0,
0,
4,
9,
0,
5,
8,
0,
15,
0,
0,
8,
2,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0],
11: [15,
12,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
0,
6,
2],
12: [0,
0,
0,
0,
0,
1,
2,
0,
2,
2,
0,
15,
12,
0,
8,
1,
9,
2,
0,
0,
0,
0,
0,
0,
0,
0],
13: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
6,
0,
5,
3,
11,
15,
11,
1,
0,
0,
0,
0,
0,
0],
14: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
3,
0,
7,
12,
14,
1,
0,
0,
0,
0,
0,
0],
15: [0,
0,
2,
2,
15,
15,
10,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
2,
15],
16: [0,
0,
0,
0,
0,
0,
0,
13,
6,
3,
0,
2,
0,
15,
4,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
17: [0,
0,
0,
0,
7,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
15,
8,
2],
18: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
12,
9,
2,
0,
0],
19: [0,
10,
11,
6,
6,
2,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
20: [0,
0,
0,
0,
15,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
4,
14,
13,
11],
21: [0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
0,
5,
15,
0,
2,
3,
10,
9,
0,
0,
0,
0,
0,
0,
0,
0],
22: [0,
0,
0,
2,
7,
15,
15,
0,
2,
3,
0,
15,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
0,
4],
23: [2,
15,
15,
4,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
11],
24: [0,
0,
0,
0,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
6,
15,
14,
2],
25: [0,
9,
13,
15,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
26: [0,
0,
0,
1,
2,
8,
15,
0,
1,
4,
0,
15,
1,
0,
6,
0,
0,
0,
0,
0,
0,
0,
0,
8,
0,
0],
27: [0,
0,
0,
0,
0,
0,
0,
0,
2,
1,
0,
0,
0,
5,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
28: [0,
7,
9,
6,
6,
4,
7,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5],
29: [8,
6,
2,
2,
6,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
7,
11],
30: [0,
10,
11,
15,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
31: [6,
15,
6,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
8],
32: [0,
0,
0,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
2,
11,
8,
9,
2],
33: [0,
0,
0,
0,
0,
11,
15,
0,
2,
4,
0,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
34: [4,
15,
15,
5,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
1,
8],
35: [2,
1,
1,
0,
12,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
9,
15,
14,
12],
36: [0,
0,
0,
0,
0,
0,
5,
15,
4,
2,
0,
0,
0,
1,
2,
0,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0],
37: [0,
0,
0,
0,
0,
11,
15,
0,
15,
15,
0,
14,
0,
3,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
38: [0,
0,
0,
0,
0,
3,
14,
0,
10,
15,
0,
14,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
39: [0,
0,
2,
0,
8,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
15,
15,
5],
40: [0,
0,
0,
3,
0,
4,
10,
5,
15,
14,
0,
2,
2,
13,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
41: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
10,
0,
0,
0,
0,
0,
0],
42: [0,
0,
0,
0,
0,
0,
0,
1,
0,
2,
0,
2,
0,
10,
15,
14,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0],
43: [0,
0,
0,
0,
0,
0,
3,
2,
7,
8,
0,
2,
0,
15,
8,
2,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0],
44: [0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
11,
15,
0,
3,
0,
13,
12,
0,
0,
0,
0,
0,
0,
0,
0],
45: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
11,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0],
46: [0,
0,
0,
0,
0,
3,
11,
0,
15,
15,
0,
15,
2,
9,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
47: [0,
2,
3,
0,
15,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
15,
15,
6],
48: [0,
0,
0,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
0,
11,
7,
9,
3],
49: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
12,
15,
9,
0,
0,
0]}}
uj5u.com熱心網友回復:
嘗試:
# id1: [0, 0, 0, 1, 1, 1]
m = np.repeat(np.vstack(df1['vector']), df1.shape[0], axis=0)
# id2: [0, 1, 0, 1, 0, 1]
n = np.tile(np.vstack(df1['vector']), (df1.shape[0], 1))
# number of devices for each vector of m
d = np.count_nonzero(m, axis=1, keepdims=True)
# compute the distance
dist = np.sqrt(np.sum((m - n)**2/d, axis=-1))
# create the final dataframe
mi = pd.MultiIndex.from_product([df1['id']] * 2, names=['id1', 'id2'])
out = pd.DataFrame({'vector1': m.tolist(),
'vector2': n.tolist(),
'distance': dist}, index=mi).reset_index()
輸出:
>>> out
id1 id2 vector1 vector2 distance
0 A4070270297516241 A4070270297516241 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... 0.000000
1 A4070270297516241 A4060461064716279 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 2, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 13.747727
2 A4070270297516241 A4050500015016271 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, ... 14.628739
3 A4070270297516241 A4050494283416274 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [15, 13, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... 15.033296
4 A4070270297516241 A4050500876316279 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 15.099669
... ... ... ... ... ...
2495 N4010442139713366 A4070271195016244 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0, 0,... 13.916417
2496 N4010442139713366 A4060454663216271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 3, 11, 0, 15, 15, 0, 15, 2, 9,... 21.330729
2497 N4010442139713366 A4060454590416272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 2, 3, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 19.304576
2498 N4010442139713366 A4060461993616279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 12.288206
2499 N4010442139713366 N4010442139713366 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0.000000
[2500 rows x 5 columns]
表現
%timeit loop_op()
3.75 s ± 88.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vect_corralien()
4.16 ms ± 12.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
uj5u.com熱心網友回復:
您可以通過廣播計算平方距離。完成后,您只需為每一行找到 num_devices 并使用它來計算您的自定義距離。用無限值填充對角線后,您可以獲得每行的最小索引,這為您提供最接近的設備。
arr = np.array([value for value in data['vector'].values()])
squared_distances = np.power((arr[:,None,:] - arr[None,:,:]).sum(axis=-1), 2)
num_devices = (arr != 0).sum(axis=1)
distances = np.sqrt(squared_distances / num_devices )
np.fill_diagonal(distances, np.inf)
closest_indentifiers = distances.argmin(axis=0)
您可以根據需要格式化程式的輸出
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/366592.html
標籤:python-3.x pandas
上一篇:部分級別的熊貓多索引交集
