假設我有以下資料框,其中第一列是設施,第二列是設施生產的產品。
d = {'Facility': ['A', 'A', 'A', 'B', 'B','B', 'C', 'C', 'C'],
'Product': ['P1', 'P2', 'P3', 'P1', 'P2', 'P4', 'P1', 'P3', 'P5']}
df = pd.DataFrame(data=d)
df
Facility Product
0 A P1
1 A P2
2 A P3
3 B P1
4 B P2
5 B P4
6 C P1
7 C P3
8 C P5
我的目標是弄清楚每對設施生產了多少相同的產品。例如,對于給定的示例,設施 A 和設施 B 同時產生 P1 和 P2,因此,我希望得到兩個作為結果。
我撰寫了以下代碼片段,但是,它的執行速度很慢。我想知道是否有人可以向我展示實作這一目標的更好方法。
#get all the product information
products = df['Product'].unique()
#create pair dataframe
colNames = ['Fac1', 'Fac2', 'Count']
linkedFacilities = pd.DataFrame(columns = colNames)
for p in products:
#get a sub data frame with the corresponding product
myDf = df[df['Product'].str.contains(p)]
myList = [ (f1, f2)
for f1 in myDf.Facility
for f2 in myDf.Facility if f1 != f2]
temp = pd.DataFrame(data=myList, columns=colNames[:2])
#for each identified product between two facilities, starts adding a column
temp['Count'] = 1
linkedFacilities = pd.concat([linkedFacilities, temp],ignore_index=True)
然后,我累積每一行并消除重復項。
grouped = linkedFacilities.groupby(["Fac1", "Fac2"]).size().reset_index()
grouped.rename(index={0: "Fac1", 1: "Fac2", 2: "Count"})
m = pd.DataFrame(np.sort(grouped[['Fac1','Fac2']], axis=1), index=grouped.index).duplicated()
grouped = grouped[~m]
grouped
A B 2
A C 2
B C 1
uj5u.com熱心網友回復:
merge這是使用and的一個選項value_counts。然后通過使索引凍結集洗掉對并使用以下方法過濾掉重復項duplicated:
counts = (df.merge(df, on='Product', suffixes=('_1', '_2'))
.query('Facility_1!=Facility_2')
.value_counts(['Facility_1','Facility_2']))
counts = (counts.reset_index(name='Number of shared products')
.loc[pd.Series(counts.index.map(frozenset)).duplicated(keep='last')]
.reset_index(drop=True))
輸出:
Facility_1 Facility_2 Number of shared products
0 A B 2
1 A C 2
2 B C 1
uj5u.com熱心網友回復:
這是一種使用分組聚合的方法:
from itertools import combinations
(df
.groupby('Product')['Facility']
.agg(lambda x: list(map(frozenset, combinations(set(x), r=2))))
.explode()
.value_counts()
)
輸出:
(A, C) 2
(A, B) 2
(C, B) 1
Name: Facility, dtype: int64
替代方案,對于設施的一致排序:
from itertools import combinations
out = (df
.groupby('Product')['Facility']
.agg(lambda x: list(combinations(sorted(set(x)), r=2)))
.explode()
.value_counts()
)
輸出:
(A, B) 2
(A, C) 2
(B, C) 1
Name: Facility, dtype: int64
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/469024.html
上一篇:從熊貓的列中減去特定值
