csvimport-如何巧妙地檢查列名是否“正確”？-有解無憂

我正在嘗試根據以 csv 手動撰寫的資料對電網進行建模。例如，我有一個應該被呼叫的列'DEPART 1'。我經常能找到'Départ 1', 'DEP1', 'depart 1',' DEPART 1 '或許多其他的可能性...

知道，我正在匯入它：

import_net_data = pd.read_excel(path_file, sheet_name=None)

我希望能夠識別接近“官方名稱”的列（也許通過忽略空格，maj ...）

有沒有辦法正確的方法：

用正確的字串替換任何不正確的字串（不提供所有可能性）
檢查那些列名是否只出現一次

uj5u.com熱心網友回復：

您需要在這里使用模糊字串匹配。對于 python，作為一個選項，您可以查看thefuzz包，它計算字串的Levenshtein 距離。

舉個例子：

from thefuzz import fuzz


st = 'DEPART 1'
strs = [ 'Départ 1', 'DEP1','depart 1',' DEPART 1 ']

for s in strs:
    l_d= fuzz.ratio(st.lower(), s.lower()) # Levenshtein distance
    print(st, s, '|', 'Levenshtein distance: ', l_d, 'is the same: ', l_d > 60)

輸出：

DEPART 1 Départ 1 | Levenshtein distance:  88   is the same:  True
DEPART 1 DEP1     | Levenshtein distance:  67   is the same:  True
DEPART 1 depart 1 | Levenshtein distance:  100  is the same:  True
DEPART 1 DEPART 1 | Levenshtein distance:  89   is the same:  True

查看更多資訊：https ://www.datacamp.com/community/tutorials/fuzzy-string-python

使用它，您可以實作您的目標。

“替換任何不正確的字串”：

import pandas as pd
from thefuzz import fuzz

st = 'DEPART 1'

df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(df)

cols = []
for column in df.columns:
    if fuzz.ratio(st.lower(), column.lower()) > 60:
        cols.append(st)
    else:
        cols.append(column)

df.columns = cols

print(df)

輸出：

Columns: [DEPART 1, DEP1, depart 1, depart 1, not even close]
Columns: [DEPART 1, DEPART 1, DEPART 1, DEPART 1, not even close]

“檢查列名的出現”：

import pandas as pd
import collections

df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])

print(collections.Counter(df.columns))

輸出：

Counter({'depart 1': 2, 'DEPART 1': 1, 'DEP1': 1, 'not even close': 1})

uj5u.com熱心網友回復：

我建議您使用正則運算式來識別這些列名之間合適的模式，并將它們替換為正式名稱。

您可以使用該re庫來執行此操作。將其與regex101 網站相結合，以找到適合所有情況的最佳正則運算式。

這是一個解決此特殊情況的小代碼示例：

import re

official_name = "depart 1"

column_names = [
    "Départ 1",
    "DEP1",
    "depart 1",
    " DEPART 1 ",
    " depart      1"]
    
regex = "\s*[d^D][e^E^é^é][p^P]\D*\s*1\s*"

for name in column_names:
    print(name)
    result = re.search(regex, name)
    if result:
        print("Replace with {0}".format(official_name))
    else:
        print("Could not find the regex pattern")

它輸出這個：

Départ 1
Replace with depart 1
DEP1
Replace with depart 1
depart 1
Replace with depart 1
 DEPART 1 
Replace with depart 1
 depart      1
Replace with depart 1

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/418859.html

標籤：

上一篇：如何從Java中的csv檔案中洗掉包含空白單元格的行

下一篇：如何根據其條目列單獨保存csv