我想為以下用例在文本列上創建索引。我們有一個Segment帶有content文本型別列的表。我們使用 pg_trgm 基于相似性執行查詢。這在翻譯編輯器中用于查找相似的字串。以下是表的詳細資訊:
CREATE TABLE public.segments
(
id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
language_id integer NOT NULL,
content text NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT segments_pkey PRIMARY KEY (id),
CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
REFERENCES public.languages (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)
這是查詢(Ruby Hanami):
def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
aggregate(:translation_records)
.where(language_id: source_lang)
.where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
.select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
.order { similarity(:content, source_text_for_lookup).desc }
end
- -編輯 - -
這是查詢:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC
SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"
---結束編輯---
---編輯1---
重新索引呢?最初我們將匯入大約 200 萬條舊記錄。我們應該何時以及多久重建一次索引(如果有的話)?
---結束編輯1---
像 CREATE INDEX ON segment USING gist (content) 這樣的東西可以嗎?我真的無法找到哪些可用索引最適合我們的用例。
最好的,塞巴
uj5u.com熱心網友回復:
CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);
uj5u.com熱心網友回復:
您顯示的第二個查詢似乎與此問題無關。
您的第一個查詢不能使用三元索引,因為查詢必須以運算子形式而不是函式形式撰寫才能做到這一點。
在運算子形式中,它看起來像這樣:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity"
FROM segments
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';
為了%等效于similarity("content", 'This will not work.') > 0.45,您首先需要執行set pg_trgm.similarity_threshold TO 0.45;.
現在你如何讓 ruby??/hanami 生成這個表單,我不知道。
% 運算子可以由 gin_trgm_ops 索引或 gist_index_ops 索引支持。<-> 只能由 gist_trgm_ops 支持。但是很難預測這種支持的效率。如果您的“內容”列很長,或者您要比較的文本很長,則不太可能非常有效,尤其是在 gist 的情況下。
理想情況下,您可以按 language_id 對表進行磁區。如果沒有,那么構建具有兩列的多列索引可能會有所幫助。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/383481.html
標籤:红宝石 PostgreSQL的 索引 花见
