具有自動更正功能的快速搜索（GININDEX和PG

我正在測驗一個簡單的搜索機制來處理小錯別字/拼寫錯誤。類似于自動更正機制。

我為此苦苦掙扎。所以我正在創建一個函式 (pl/pgsql) 來處理這個問題，并且我在 SUPABASE.IO、PostgreSQL 13.3（類似于 RDS）上運行它。

我想要：

將回傳的結果限制為高度相似的電子郵件地址，比如相似度 > 0.7；
使用 INDEX，因為實際的電子郵件串列將達到數千萬的數量級，因此它必須在一秒鐘內回傳。

DROP TABLE IF EXISTS email;
CREATE TABLE email (
  email_address TEXT NOT NULL UNIQUE,
  person_id UUID NOT NULL, 
  CONSTRAINT email_pk PRIMARY KEY (email_address)
);

DROP INDEX IF EXISTS email_address_trigram_idx;
CREATE INDEX email_address_trigram_idx ON email USING gin(email_address gin_trgm_ops);

INSERT INTO email(email_address, person_id) VALUES
  ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4());

SET pg_trgm.similarity_threshold = 0.8; -- This doesn't seem to affect my queries

SELECT *, similarity('[email protected]', email_address)
FROM email
WHERE email_address % '[email protected]';

我想要一種快速搜索的方法，并且仍然可以容忍搜索中的一些小錯別字。

uj5u.com熱心網友回復：

首先，您的表定義在上創建兩個唯一索引(email_address)。別。洗掉UNIQUE約束，保留 PK：

CREATE TABLE email (
  email_address text PRIMARY KEY
, person_id uuid NOT NULL  -- bigint?
);

（也不知道你為什么會需要uuid的person_id。有幾乎沒有足夠的人在世界上的理由不止一個bigint。）

接下來，既然你想...

將回傳的結果限制為僅高度相似的電子郵件地址，

我建議最近鄰搜索。為此目的創建GiST索引而不是 GIN：

CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);

并使用這樣的查詢：

SELECT *, similarity('[email protected]', email_address)
FROM   email
WHERE  email_address % '[email protected]'
ORDER  BY email_address <-> '[email protected]'  -- note the use of the operator <->
LIMIT  10;

參考手冊：

這可以通過 GiST 索引非常有效地實作，但不能通過 GIN 索引實作。當只需要少數最接近的匹配時，它通常會擊敗第一個公式。

使用 small 時LIMIT，可能不需要設定pg_trgm.similarity_threshold很高，因為此查詢首先為您提供最佳匹配。

有關的：

使用 pg_trgm 搜索 3 億個地址
使用 PostgreSQL 快速查找相似字串
PostgreSQL GIN 索引比 pg_trgm 的 GIST 慢嗎？

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/343258.html

標籤：PostgreSQL plpgsql 模糊搜索自动更正超基地

上一篇：創建函式的問題

下一篇：帶有外鍵的Django中的Querysetfilter()

具有自動更正功能的快速搜索（GININDEX和PG_TRGM擴展名）