這是我的資料-
inp = [{'father_husband_mother_name': [['Father s Name', 0.8603670001029968],
['Shripati', 0.8603670001029968],
['Father s Name', 0.8903670001029969],
['Shpppati', 0.8903670001029969]],
'doc_id': [['GGX2176', 0.8435981869697571],
['GGC2176', 0.8835981869697571]],
'name': [['Elector s Name', 0.8301510810852051],
['Shibshankar Ghosh', 0.8301510810852051],
['Elector s Name', 0.8501510810852051],
['Shibshankar Ghosh', 0.8501510810852051]],
'date_of_birth': [['Age as on 1.1.2000', 0.8067844915390014],
['15', 0.8067844915390014],
['Age as on 1.1.2000', 0.8267844915390015],
['15', 0.8267844915390015]],
'gender_sex': [['Sex', 0.7784658074378967],
['M', 0.7784658074378967],
['Sex', 0.8784658074378967],
['M', 0.8784658074378967]]}]
STOPWORDS = ['Sex', 'Father s Name', 'Elector s Name', 'Address', 'Name', 'Gender', 'Mother s Name',
'Husband s Name']
我期望的輸出:
{'father_husband_mother_name': 'Shpppati',
'doc_id': 'GGC2176',
'name': 'Shibshankar Ghosh',
'date_of_birth': 'Age as on 1.1.2000,15',
'gender_sex': 'M'}
這是邏輯 -
檢索每個鍵中不存在的具有最高置信度得分 [float串列串列內部] 的值。STOPWORDS
我試過的 -
def process_kie_dict(voter_raw_labels, threshold=0.7):
cleaned_dict = {}
intermediate_dict = {}
for entity_dict in voter_raw_labels:
for entity, val in entity_dict.items():
conf_val = [item[1] for item in val]
unique_val = list(set(conf_val))
max_conf = max(unique_val)
if max_conf > threshold:
if len(unique_val)==1:
add_val = [item[0] for item in val]
else:
max_conf_index = conf_val.index(max_conf)
add_val = [item[0] for item in val[max_conf_index:]]
if entity not in intermediate_dict.keys():
intermediate_dict[entity] = [add_val,max_conf]
else:
if intermediate_dict[entity][1] < max_conf:
intermediate_dict[entity] = [add_val,max_conf]
# print(intermediate_dict)
for key, val in intermediate_dict.items():
final_value = ''
for value in val[0]:
m = len(str.strip(value))
edit_dist_list = []
for word in STOPWORDS:
n = len(word)
edit_dist = editDistDP(value, word, m, n)
edit_dist_list.append(edit_dist)
if min(edit_dist_list) < 2:
value=''
final_value = final_value value ','
clean_value = final_value.strip(",")
cleaned_dict[key]=clean_value
return cleaned_dict
def editDistDP(str1, str2, m, n):
# Create a table to store results of subproblems
dp = [[0 for x in range(n 1)] for x in range(m 1)]
# Fill d[][] in bottom up manner
for i in range(m 1):
for j in range(n 1):
# If first string is empty, only option is to
# insert all characters of second string
if i == 0:
dp[i][j] = j # Min. operations = j
# If second string is empty, only option is to
# remove all characters of second string
elif j == 0:
dp[i][j] = i # Min. operations = i
# If last characters are same, ignore last char
# and recur for remaining string
elif str1[i-1] == str2[j-1]:
dp[i][j] = dp[i-1][j-1]
# If last character are different, consider all
# possibilities and find minimum
else:
dp[i][j] = 1 min(dp[i][j-1], # Insert
dp[i-1][j], # Remove
dp[i-1][j-1]) # Replace
return dp[m][n]
您可以忘記編輯距離的實作,這并不重要。我想知道的是嵌套的 for 回圈,這不會大規模作業。尋找更有效的實施。
uj5u.com熱心網友回復:
這是您的資料的決議器
result = {k: sorted(v, key=lambda x: x[1] if x[0] not in STOPWORDS else 0)[-1][0] for k, v in inp[0].items()}
簡而言之,它需要一個鍵并根據置信度值對字典的其余部分進行排序,除非串列的第一個元素包含在STOPWORDS. 然后將該排序串列的第一個元素result作為值添加到字典中。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/514368.html
