我是 python 的相對新手,我正在嘗試從具有 ID 串列的資料幀重建對話/執行緒。
我目前有一個推文/reddit帖子的pandas資料框,其格式大致如下:
| ID | 文本 | parent_id | 回復 |
|---|---|---|---|
| 編號1 | 呸呸 | _ 郵政 _ | id2、id3、id4、id5、id6、id7 |
| 編號2 | 呸呸 | 編號1 | id4, id5, id6, id7 |
| 編號3 | 呸呸 | 編號1 | |
| 編號4 | 呸呸 | 編號2 | id6, id7 |
| 編號5 | 呸呸 | 編號2 | |
| id6 | 呸呸 | 編號4 | id7 |
| id7 | 呸呸 | id6 |
我的目標是根據 id 將資料分成執行緒/對話。這意味著,從上面的例子中,得到以下序列作為輸出:
[id1, id2, id4, id6],
[id1, id2, id4, id7],
[id1, id2, id5], &
[id1,id3]。
擁有這些串列將使我能夠完整地查看執行緒。目前我的代碼非常復雜,看起來像這樣:
out_list = []
for i, row in df.iterrows():
id_ = row["id"]
# create our output file
sequence = [id_]
replies = list(row['replies'])
# creates a new dataframe from the replies to the topline comment in question
reply_df= df.loc[df['id'].isin(replies)]
reply_df = reply_df[reply_df.Parent_id2 == id_]
#check if ends at topline
if reply_df.empty == False:
def turn_recursion(df, reply_df):
for j, row_ in reply_df.iterrows():
replies_2 = reply_df.loc[j, 'replies']
id_2 = row_["id"]
reply_df2 = df.loc[df['id'].isin(replies_2)]
reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]
nonlocal sequence
nonlocal out_list
if reply_df2.empty == False:
sequence.append(id_2)
return(turn_recursion(df, reply_df2))
else:
sequence.append(id_2)
out_list.append(sequence)
turn_recursion(test2, reply_df)
else:
out_list.append(sequence)
這目前給了我半準確的結果,但不是得到:[[id1, id2, id4, id6],[id1, id2, id4, id7]],我得到:[id1, id2, id4, id6, id7] .
我意識到我可能有點昏昏欲睡,并且有一個簡單的解決方案,但對于我的生活,我似乎無法找到一種方法來做到這一點,以便它可以正常作業并且適用于任何執行緒長度。
預先感謝您的任何建議。:)
uj5u.com熱心網友回復:
使用networkx以達到你想要什么:
import pandas as pd
import networkx as nx
from collections import defaultdict
data = defaultdict(list)
# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
create_using=nx.DiGraph)
# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Enumerate all possible paths
for node in df['id']:
for leaf in leaves:
for path in nx.all_simple_paths(G, node, leaf):
data[node].append(path)
輸出:
>>> data
defaultdict(list,
{'id1': [['id1', 'id3'],
['id1', 'id2', 'id5'],
['id1', 'id2', 'id4', 'id6', 'id7']],
'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
'id4': [['id4', 'id6', 'id7']],
'id6': [['id6', 'id7']]})
如果要將字典合并到資料框:
df['replies'] = df['id'].map(data)
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1 id2 blah blah id1 [[id2, id5], [id2, id4, id6, id7]]
2 id3 blah blah id1 []
3 id4 blah blah id2 [[id4, id6, id7]]
4 id5 blah blah id2 []
5 id6 blah blah id4 [[id6, id7]]
6 id7 blah blah id6 []
現在你可以分解你的資料框:
df = df.explode('replies')
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [id1, id3]
0 id1 blah blah _ post _ [id1, id2, id5]
0 id1 blah blah _ post _ [id1, id2, id4, id6, id7]
1 id2 blah blah id1 [id2, id5]
1 id2 blah blah id1 [id2, id4, id6, id7]
2 id3 blah blah id1 NaN
3 id4 blah blah id2 [id4, id6, id7]
4 id5 blah blah id2 NaN
5 id6 blah blah id4 [id6, id7]
6 id7 blah blah id6 NaN
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/318758.html
