我正在測驗一個簡單的搜索機制來處理小錯別字/拼寫錯誤。類似于自動更正機制。
我為此苦苦掙扎。所以我正在創建一個函式 (pl/pgsql) 來處理這個問題,并且我在 SUPABASE.IO、PostgreSQL 13.3(類似于 RDS)上運行它。
我想要:
- 將回傳的結果限制為高度相似的電子郵件地址,比如相似度 > 0.7;
- 使用 INDEX,因為實際的電子郵件串列將達到數千萬的數量級,因此它必須在一秒鐘內回傳。
DROP TABLE IF EXISTS email;
CREATE TABLE email (
email_address TEXT NOT NULL UNIQUE,
person_id UUID NOT NULL,
CONSTRAINT email_pk PRIMARY KEY (email_address)
);
DROP INDEX IF EXISTS email_address_trigram_idx;
CREATE INDEX email_address_trigram_idx ON email USING gin(email_address gin_trgm_ops);
INSERT INTO email(email_address, person_id) VALUES
('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4());
SET pg_trgm.similarity_threshold = 0.8; -- This doesn't seem to affect my queries
SELECT *, similarity('[email protected]', email_address)
FROM email
WHERE email_address % '[email protected]';
我想要一種快速搜索的方法,并且仍然可以容忍搜索中的一些小錯別字。
uj5u.com熱心網友回復:
首先,您的表定義在 上創建兩個唯一索引(email_address)。別。洗掉UNIQUE約束,保留 PK:
CREATE TABLE email (
email_address text PRIMARY KEY
, person_id uuid NOT NULL -- bigint?
);
(也不知道你為什么會需要uuid的person_id。有幾乎沒有足夠的人在世界上的理由不止一個bigint。)
接下來,既然你想...
將回傳的結果限制為僅高度相似的電子郵件地址,
我建議最近鄰搜索。為此目的創建GiST索引而不是 GIN:
CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);
并使用這樣的查詢:
SELECT *, similarity('[email protected]', email_address)
FROM email
WHERE email_address % '[email protected]'
ORDER BY email_address <-> '[email protected]' -- note the use of the operator <->
LIMIT 10;
參考手冊:
這可以通過 GiST 索引非常有效地實作,但不能通過 GIN 索引實作。當只需要少數最接近的匹配時,它通常會擊敗第一個公式。
使用 small 時LIMIT,可能不需要設定pg_trgm.similarity_threshold很高,因為此查詢首先為您提供最佳匹配。
有關的:
- 使用 pg_trgm 搜索 3 億個地址
- 使用 PostgreSQL 快速查找相似字串
- PostgreSQL GIN 索引比 pg_trgm 的 GIST 慢嗎?
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/343258.html
標籤:PostgreSQL plpgsql 模糊搜索 自动更正 超基地
上一篇:創建函式的問題
