從對串列開始替換列中的值的有效方法-有解無憂

我正在嘗試替換資料中的重復項，并且正在尋找一種有效的方法來做到這一點。

我有一個包含 2 列 idA 和 idB 的 df，如下所示：

這是一個有相似之處的df。我想創建一個字典，其中鍵是 id，值是一個串列，其中包含鏈接到該鍵的所有設備。例子：

d[5] = [22, 6000]
d[22] = [5, 590]

我正在做的是以下內容：

ids = set(gigi_confirmed['idA'].unique()).union(set(gigi_confirmed['idB'].unique()))

dup_list = list(zip(A_confirmed, B_confirmed))

dict_dup = dict()


for j in ids:
    
    l1 = []
    
    for i in range(0, len(dup_list)):
    
        if j in dup_list[i]:
            
            l2 = list(dup_list[i])
            l2.remove(j)
                       
            l1.append(l2[0])
            
            dict_dup[j] = l1

有沒有可能讓它更有效率？

uj5u.com熱心網友回復：

我必須在這里做一些猜測，因為你的問題不是很清楚，但我理解它的方式，你想要一個字典，將每個 id 映射到idA或映射idB到另一側找到的 id 串列，從那個 id。

如果我正確理解了您的問題，我將通過直接構造一個將 ids 映射到 ids 集的字典來解決它。

idA = [22, 22, 5]
idB = [5, 590, 6000]

dict_dup = dict()
for a, b in zip(idA, idB):
    if a not in dict_dup:
        dict_dup[a] = set()
    dict_dup[a].add(b)

    if b not in dict_dup:
        dict_dup[b] = set()
    dict_dup[b].add(a)

運行后，print(dict_dup)輸出

{22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}}

我認為這是您正在尋找的資料結構。

通過使用 dicts 和 sets，此代碼非常有效。它將在 id 數量上以線性時間運行。

使用 defaultdict 的代碼更短

您還可以通過使用 adefaultdict而不是常規來使此代碼更短dict，這將在需要時自動創建這些空集：

from collections import defaultdict

idA = [22, 22, 5]
idB = [5, 590, 6000]

dict_dup = defaultdict(set)
for a, b in zip(idA, idB):
    dict_dup[a].add(b)
    dict_dup[b].add(a)

print 陳述句產生的輸出略有不同，但它是等價的：

defaultdict(<class 'set'>, {22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}})

This still contains the info you want, and is just as efficient as the first solution.

Putting it back in your data frame

Now, if you need to put this information back in your dataframe, you can use dict_dup to efficiently retrieve what you're looking for for each row.

uj5u.com熱心網友回復：

Assuming this is a pandas DataFrame, we can groupby "idA", collect "idB" values of each group in a list and use to_dict for the dictionary:

out = df.groupby('idA')['idB'].apply(list).to_dict()

Output:

{5: [6000], 22: [5, 590]}

That being said, it's not exactly the best way to replace duplicates imo. Why are you creating a dictionary? Why not work on the DataFrame itself? But given the very limited data you have provided, we can only speculate.

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/420794.html

標籤：

上一篇：提高（識別）渲染性能

下一篇：python中的反向排序出錯了