我有一個資料框和一個字串串列:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARI 4
P ARIS 5
NOW YORK 6
PORTUGAL 7
PORTUGLA 8
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
我正在使用 Fuzzywuzzy python 庫。此方法回傳一個數字,表示兩個比較字串的相似程度:示例:fuzz.partial_ratio("BRASIL", "BRAZIL")
# Output:
88
我想遍歷資料框的“名稱”列并將字串與 var_string_correct 進行比較。如果這些相似,我想用正確的名稱(即字串的名稱)替換它。所以,我做了以下代碼:
for i in range(0, len(df)):
for j in range(0, len(list_string_correct)):
var_string = list_string_correct[j]
# Return number [0 until 100]
result = fuzz.partial_ratio(var_string, df['Name'].iloc[i])
if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition
df['Name'].loc[i] = var_string
代碼正在運行。輸出如所愿:
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARIS 4
PARIS 5
NEW YORK 6
PORTUGAL 7
PORTUGAL 8
但是,我需要使用兩個 for() 命令。有沒有辦法替換 for() 并保持相同的輸出?
要安裝庫,請使用:
pip install fuzzywuzzy
pip install python-Levenshtein
uj5u.com熱心網友回復:
process.extractOne從thefuzz包中嘗試(的繼任者fuzzywuzzy,相同的作者,相同的 api):
# from fuzzywuzzy import process
from thefuzz import process
THRESHOLD = 80
df['Name'] = \
df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])
輸出:
>>> df
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8
uj5u.com熱心網友回復:
如果由于某種原因您需要使用該fuzzywuzzy軟體包(而不是thefuzz@Corralien 推薦的),您可以使用一個回圈來代替:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
for correct_name in list_string_correct:
df['Name'] = df['Name'].apply(lambda x: correct_name if fuzz.partial_ratio(correct_name, x) >= 80 else x)
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/404913.html
標籤:
上一篇:在Python函式中使用可變型別作為默認引數是否有注意事項?
下一篇:如何訪問熊貓資料框中的嵌套資料?
