生成具有兩個字串列之間的字串相似距離的新列的有效方法-有解無憂

(1138812, 14)我有一個帶有形狀和列的熊貓資料框

['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
       'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
       'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
       'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
       'point_of_interest_2', 'match']

我想基于使用Levenshtein和difflib difflib.SequenceMatcher().ratio()的字串相似距離創建新列，Levenshtein.distance()以及在每個列Levenshtein.jaro_winkler()之間LongestCommonSubstring()

['name', 'address', 'city', 'state',
       zip', 'country', 'url', 'phone', 'categories']

和相應_2的后綴列。最后它會給我 9*4 = 36 個新列。現在，我正在使用df.iterrows()回圈資料框并制作列串列。但這非常非常耗費時間和記憶體。使用完整的 16GB 記憶體需要 3.5 小時才能完成整個資料幀。我正在嘗試在時間和記憶方面找到一種更好的方法來獲得我的結果。我的代碼：

import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
           'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
    data_dict[f"{i}_geshs"] = []
    data_dict[f"{i}_levens"] = []
    data_dict[f"{i}_jaros"] = []
    data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
    for j in columns:
        data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
        data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)

uj5u.com熱心網友回復：

從如下所示的資料框開始：

名	地址	城市	狀態	壓縮	網址	電話	類別	名字_2	地址2	城市_2	state_2	zip_2	url_2	電話_2	類別_2
蘿莉	680布爾十字路口	達拉斯	德克薩斯州	75277	url_shortened	214-533-2179	花崗巖表面	奧古斯丁	7 席勒十字路口	拉伯克	德克薩斯州	79410	url_shortened	806-729-7419	屋頂（金屬）
德米特里	05 柯立芝之路	查爾斯頓	西弗吉尼亞	25356	url_shortened	304-906-6384	結構和雜項鋼（制造）	科爾尼	0547克萊蒙斯廣場	皮奧里亞	伊利諾伊州	61651	url_shortened	309-326-4252	框架（鋼）

并且是形狀1024000 rows × 16 columns

import difflib
import Levenshtein
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all

def dists(x, y):
    matcher = difflib.SequenceMatcher(None, x, y)
    geshs = matcher.ratio()
    levens = Levenshtein.distance(x, y)
    jaros = Levenshtein.jaro_winkler(x, y)
    lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
    return [geshs, levens, jaros, lcss]


df = pd.read_csv('MOCK_DATA.csv')
df = df.astype(str) # force all fields to strings.

cols = df.columns
cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
for x, y in zip(*cols):
    (df[x   '_geshs'], 
     df[x   '_levens'], 
     df[x   '_jaros'], 
     df[x   '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
    # Replace parallel_apply with apply to run non-parallel.

（除了保留原始列）我在 3 分鐘內得到這些列，沒有并行化，它仍然可能只需要 ~20-30 分鐘。python 的峰值記憶體使用量僅為 3GB 左右，并且在沒有并行化的情況下會低得多。

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	地址_geshs	address_levens	地址_jaros	地址_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	類別_geshs	category_levens	類別_jaros	類別_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/467797.html

標籤：Python 熊猫编辑距离差异库

上一篇：在熊貓資料框中將http從url替換為https

下一篇：按空格拆分合并列...但某些資料在值之間有空格

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	地址_geshs	address_levens	地址_jaros	地址_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	類別_geshs	category_levens	類別_jaros	類別_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	地址_geshs	address_levens	地址_jaros	地址_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	類別_geshs	category_levens	類別_jaros	類別_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	地址_geshs	address_levens	地址_jaros	地址_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	類別_geshs	category_levens	類別_jaros	類別_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3