我正在嘗試根據以 csv 手動撰寫的資料對電網進行建模。例如,我有一個應該被呼叫的列'DEPART 1'。我經常能找到'Départ 1', 'DEP1', 'depart 1',' DEPART 1 '或許多其他的可能性...
知道,我正在匯入它:
import_net_data = pd.read_excel(path_file, sheet_name=None)
我希望能夠識別接近“官方名稱”的列(也許通過忽略空格,maj ...)
有沒有辦法正確的方法:
- 用正確的字串替換任何不正確的字串(不提供所有可能性)
- 檢查那些列名是否只出現一次
uj5u.com熱心網友回復:
您需要在這里使用模糊字串匹配。對于 python,作為一個選項,您可以查看thefuzz包,它計算字串的Levenshtein 距離。
舉個例子:
from thefuzz import fuzz
st = 'DEPART 1'
strs = [ 'Départ 1', 'DEP1','depart 1',' DEPART 1 ']
for s in strs:
l_d= fuzz.ratio(st.lower(), s.lower()) # Levenshtein distance
print(st, s, '|', 'Levenshtein distance: ', l_d, 'is the same: ', l_d > 60)
輸出:
DEPART 1 Départ 1 | Levenshtein distance: 88 is the same: True
DEPART 1 DEP1 | Levenshtein distance: 67 is the same: True
DEPART 1 depart 1 | Levenshtein distance: 100 is the same: True
DEPART 1 DEPART 1 | Levenshtein distance: 89 is the same: True
查看更多資訊:https ://www.datacamp.com/community/tutorials/fuzzy-string-python
使用它,您可以實作您的目標。
“替換任何不正確的字串”:
import pandas as pd
from thefuzz import fuzz
st = 'DEPART 1'
df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(df)
cols = []
for column in df.columns:
if fuzz.ratio(st.lower(), column.lower()) > 60:
cols.append(st)
else:
cols.append(column)
df.columns = cols
print(df)
輸出:
Columns: [DEPART 1, DEP1, depart 1, depart 1, not even close]
Columns: [DEPART 1, DEPART 1, DEPART 1, DEPART 1, not even close]
“檢查列名的出現”:
import pandas as pd
import collections
df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(collections.Counter(df.columns))
輸出:
Counter({'depart 1': 2, 'DEPART 1': 1, 'DEP1': 1, 'not even close': 1})
uj5u.com熱心網友回復:
我建議您使用正則運算式來識別這些列名之間合適的模式,并將它們替換為正式名稱。
您可以使用該re庫來執行此操作。將其與regex101 網站相結合,以找到適合所有情況的最佳正則運算式。
這是一個解決此特殊情況的小代碼示例:
import re
official_name = "depart 1"
column_names = [
"Départ 1",
"DEP1",
"depart 1",
" DEPART 1 ",
" depart 1"]
regex = "\s*[d^D][e^E^é^é][p^P]\D*\s*1\s*"
for name in column_names:
print(name)
result = re.search(regex, name)
if result:
print("Replace with {0}".format(official_name))
else:
print("Could not find the regex pattern")
它輸出這個:
Départ 1
Replace with depart 1
DEP1
Replace with depart 1
depart 1
Replace with depart 1
DEPART 1
Replace with depart 1
depart 1
Replace with depart 1
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/418859.html
標籤:
上一篇:如何從Java中的csv檔案中洗掉包含空白單元格的行
下一篇:如何根據其條目列單獨保存csv
