More Like This Query 查找與給定檔案集 “相似” 的檔案, 為此,More Like This 選擇這些輸入檔案的一組代表性術語,使用這些術語形成查詢,執行查詢并回傳結果, 用戶控制輸入檔案、應如何選擇術語以及如何形成查詢,
最簡單的用例包括請求與提供的文本片段相似的檔案, 在這里,我們要求所有在 “title” 和 “description” 欄位中包含類似于 “Once upon a time” 的文本的所有電影,將所選術語的數量限制為 12,
GET /_search
{
"query": {
"more_like_this" : {
"fields" : ["title", "description"],
"like" : "Once upon a time",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
一個更復雜的用例包括將文本與索引中已經存在的檔案混合, 在這種情況下,指定檔案的語法類似于 Multi GET API 中使用的語法,
GET /_search
{
"query": {
"more_like_this": {
"fields": [ "title", "description" ],
"like": [
{
"_index": "imdb",
"_id": "1"
},
{
"_index": "imdb",
"_id": "2"
},
"and potentially some more text here as well"
],
"min_term_freq": 1,
"max_query_terms": 12
}
}
}
最后,用戶可以混合一些文本、一組選定的檔案,但也可以提供不一定出現在索引中的檔案, 為了提供索引中不存在的檔案,語法類似于人工檔案,
GET /_search
{
"query": {
"more_like_this": {
"fields": [ "name.first", "name.last" ],
"like": [
{
"_index": "marvel",
"doc": {
"name": {
"first": "Ben",
"last": "Grimm"
},
"_doc": "You got no idea what I'd... what I'd give to be invisible."
}
},
{
"_index": "marvel",
"_id": "2"
}
],
"min_term_freq": 1,
"max_query_terms": 12
}
}
}
動手實踐
在下面,我將使用一個簡單的例子來展示如何使用 more_like_this 查詢來查找相似的檔案,盡管這個查詢是一個非常有趣的功能,但是可能很多開發者不會選擇使用這種查詢,一方面是對這個查詢不是很理解,另一方面,開發者可能會選擇使用傳統的查詢,比如 match, term 及 range,希望通過這篇文章的介紹,你會在未來的作業中根據自己使用案例選擇使用 more_like_this 查詢,
準備資料
未來這個展示,我們將使用 movies 這個資料集,
請點擊上面的 Download 鏈接下載這個資料集,把這個資料集下載下來并保存于專案的 data 子目錄中,
然后,我們可以在地址 https://github.com/liu-xiao-guo/searchflix 下載整個原始碼,并把如下的檔案拷貝出來:
- pipeline/movies.conf 檔案拷貝出來,放入到專案的根目錄中
- elastic/elasticsearch/mappings/movies.mapping 檔案拷貝出來,放入到專案的根目錄中
- dictionaries/countries_geo.csv 檔案拷貝出來,并放入到 dictionaries 子目錄下
經過這樣的操作過后,我們可以看到的檔案是這樣的:
$ pwd
/Users/liuxg/data/morelikethis
$ ls
data dictionaries movies.conf movies.mapping
$ tree -L 3
.
├── data
│ └── movies_metadata.csv
├── dictionaries
│ ├── countries_geo.csv
│ └── source.txt
├── movies.conf
└── movies.mapping
進入到專案的子目錄,我們在 terminal 中打入如下的命令:
curl -XPUT -H'Content-type: application/json' localhost:9200/movies -d@mappings/movies.mapping
我們接下來匯入資料:
sudo <path_to_logstash_unzipped>/bin/logstash -f movies.conf
在這里,我們必須使用 sudo,這是因為在 movies.conf 里,我們有使用 /dev/null,經過上面的匯入,我們可以在 Kibana 中可以查看到已經匯入的檔案:
GET movies/_count
{
"count" : 45432,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
movies 索引中,一個典型的檔案是這樣的:
GET movies/_search
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "12110",
"_score" : 1.0,
"_source" : {
"original_title" : "Dracula: Dead and Loving It",
"adult" : false,
"vote_average" : 5.7,
"genres" : [
{
"id" : 35,
"name" : "Comedy"
},
{
"id" : 27,
"name" : "Horror"
}
],
"tagline" : null,
"production_companies" : [
{
"id" : 5,
"name" : "Columbia Pictures"
},
{
"id" : 97,
"name" : "Castle Rock Entertainment"
},
{
"id" : 6368,
"name" : "Enigma Pictures"
}
],
"imdb_id" : "tt0112896",
"spoken_languages" : [
{
"iso_639_1" : "en",
"name" : "English"
},
{
"iso_639_1" : "de",
"name" : "Deutsch"
}
],
"production_countries_name_list" : [
"France",
"United States of America"
],
"@version" : "1",
"title" : "Dracula: Dead and Loving It",
"homepage" : null,
"original_language" : "en",
"belongs_to_collection" : null,
"production_countries_location_list" : [
"46.227638,2.213749",
"37.09024,-95.712891"
],
"popularity" : 5.430331,
"budget" : 0.0,
"revenue" : 0.0,
"production_countries" : [
{
"iso_3166_1" : "FR",
"location" : "46.227638,2.213749",
"name" : "France"
},
{
"iso_3166_1" : "US",
"location" : "37.09024,-95.712891",
"name" : "United States of America"
}
],
"release_date" : "1995-12-22",
"poster_path" : "/xve4cgfYItnOhtzLYoTwTVy5FGr.jpg",
"@timestamp" : "1995-12-21T16:00:00.000Z",
"id" : 12110,
"runtime" : 88.0,
"status" : "Released",
"genres_list" : [
"Comedy",
"Horror"
],
"overview" : "When a lawyer shows up at the vampire's doorstep, he falls prey to his charms and joins him in his search for fresh blood. Enter Dr. van Helsing, who may be the only one able to vanquish the count.",
"vote_count" : 210,
"video" : "false"
}
}
...
在上面,我們可以看到有一個叫做 overview 的欄位,
More Like This 查詢
more_like_this 查詢的目的是在索引檔案中查找與用戶通知的某些條目相似的檔案,他們通過從知情條目中選擇相關術語,然后使用這些術語構建查詢來做到這一點,
知情條目可以是自由文本或其他索引檔案,也就是說,你可以輕松搜索與已在同一索引或其他索引中編入索引的任何檔案相似的檔案,也就是說,我想用一個用例來演示此查詢的用法,即向用戶提供與他選擇的電影或他剛剛觀看的電影相似的電影概要,
此查詢的唯一必需引數是,你必須輸入要搜索相似檔案的文本,或包含一個物件的陣列,該物件指示要搜索的檔案的索引/ID 檔案,在第二種情況下,還可以將現有和索引檔案與人工檔案混合,即可以模擬帶有自由文本的檔案,下面是一個例子:
GET movies/_search
{
"fields": ["overview"],
"query": {
"more_like_this": {
"fields": [
"overview"
],
"like": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire.",
"min_term_freq": 1,
"max_query_terms": 12
}
}
, "_source": false
}
通常,雖然不是強制性的,但你還需要輸入 fields 引數,這是一個包含欄位名稱的陣列,將在其中檢查相似性, 另一個有趣的引數是 unlike,它與 like 結合使用(它們不相互排斥),它遵循相同的語法,并將通過排除與我們通知我們不知道的檔案相似的檔案來減少結果的數量 想, 基本上(像 X)AND(不像 Y),
此查詢中的其余引數分為兩種型別,
用于選擇術語的引數
- max_query_terms:要選擇的最大術語數,我們擁有的術語越多,準確度就越高,但以犧牲性能為代價,默認值為 25,
- min_term_freq:應忽略輸入檔案/文本中的術語的最小頻率,默認值為 2,
- min_doc_freq:檔案的最小頻率,低于該頻率的輸入檔案應被忽略,默認值為 5,
- max_doc_freq:最大檔案頻率,高于該頻率時,輸入檔案的術語應被忽略,忽略非常頻繁的術語(如停用詞)會很有用,默認情況下它被禁用 (0),
- min_word_length:最小術語長度,低于該長度的術語應被忽略,默認值為 0,
- max_word_length:最大術語大小,超過該術語應被忽略,舊名稱 max_word_len 已棄用,默認情況下它被禁用 (0),
- stop_words:一組停用詞,要忽略的術語,
分析器:用于輸入文本的分析器,默認情況下,它是與 fields 引數中通知的第一個欄位關聯的分析器,
查詢構造引數
- minimum_should_match:控制必須找到的術語數, 使用與最小值應該匹配的相同語法, 默認值為 “30%”,
- fail_on_unsupported_field:如果提供的任何欄位(欄位)不屬于任何受支持的型別(關鍵字或文本),則控制查詢是否應失敗, 默認為真,
- boost_terms:將構建的查詢中的每個術語都可以通過其 TF-IDF 分數來增強, 默認情況下它被禁用 (0),任何正值都會激活此功能,
- include:定義查詢結果中是否應回傳輸入檔案, 默認為假,
- boost:定義整個查詢的 boost 值, 默認值為 1.0,
實踐
回到之前提到的用例(尋找類似的電影向用戶推薦),讓我們做一些實驗,
在下面的示例中,我將使用電影 “Jaws”(大白鯊) 的概要并嘗試找到類似的電影:
GET movies/_search
{
"size": 5,
"_source": [
"title",
"overview"
],
"query": {
"more_like_this": {
"fields": [
"overview"
],
"min_term_freq": 1,
"max_query_terms": 12,
"like": "An insatiable great white shark terrorizes the townspeople of Amity Island, The police chief, an oceanographer and a grizzled shark hunter seek to destroy the bloodthirsty beast."
}
}
}
以下是前 5 個結果:
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "578",
"_score" : 100.41875,
"_source" : {
"overview" : "An insatiable great white shark terrorizes the townspeople of Amity Island, The police chief, an oceanographer and a grizzled shark hunter seek to destroy the bloodthirsty beast.",
"title" : "Jaws"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "52454",
"_score" : 19.65117,
"_source" : {
"overview" : "When the prehistoric warm-water beast the Crocosaurus crosses paths with that cold-water monster the Mega Shark, all hell breaks loose in the oceans as the world's top scientists explore every option to halt the aquatic frenzy. Swallowing everything in their paths -- including a submarine or two -- Croc and Mega lead an explorer and an oceanographer on a wild chase. Eventually, the desperate men turn to a volcano for aid.",
"title" : "Mega Shark vs. Crocosaurus"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "246594",
"_score" : 18.51667,
"_source" : {
"overview" : "When another Mega Shark returns from the depths of the sea, world militaries go on high alert. Ocean traffic grinds to a standstill as everyone lives in fear of the insatiable beast. Out of options, the US government unleashes the top secret Mecha Shark project -- a mechanical shark built to have the same exact characteristics as Mega. A pair of scientists pilot the mechanical creature as they fight Mega in a pitched battle to save the planet. But when faulty mechanics cause the Mecha to go after humans, the scientists must somehow guide Mega to Mecha in hopes that the two titans will kill each other - or risk untold worldwide destruction.",
"title" : "Mega Shark vs. Mecha Shark"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "43084",
"_score" : 14.461939,
"_source" : {
"overview" : "Wealthy big game hunter, Wilson Frields, funds an expedition going deep into the Florida Everglades to search for the Calusa: a lost tribe of Native Americans. When the team discover the gruesome remains of another expedition, Friels admits he is searching for the Calusa's Fountain of Youth and its guardian, a mythical and deadly beast. As they delve deeper into the Everglades, the bloodthirsty beast begins to stalk and kill members of the group and, in one struggle, their leader Brinson Thomas is injured and begins to metamorphose into a creature himself. His only hope: to drink from the waters of the Fountain. The terrible truth behind the Calusa must be discovered if any of them are going to get out of there alive!",
"title" : "Deadly Species"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "385232",
"_score" : 11.46097,
"_source" : {
"overview" : "When the powerful wizard, Lord Tensley, is jilted by Princess Ennogard, he vows to rid the land of love. He commands his fire-breathing dragon to destroy any sign of affection seen throughout the kingdom. As the death toll rises, Camilan, a brave but arrogant warrior seeks to marry his true love despite the curse upon the land. In order to fulfill his destiny, he seeks the help of his estranged brother Ramicus, a bounty hunter with no desire to get involved. It takes an enchanted distress message and the promise of great reward from the beautiful Princess Ennogard, to lure Ramicus into the quest to defeat the wizard and his terrible beast.",
"title" : "Dudes & Dragons"
}
}
]
請注意,第一個結果正是 “Jaws” 本身,因為我沒有執行指示該電影的檔案的查詢(如果我這樣做了,因為 include 引數默認為 false,檔案本身將不會回傳), 但是在類似引數中,我告知了它在索引檔案中出現的概要,并且索引中肯定不會有比檔案本身更相似的檔案,
至于其他結果,它們都與鯊魚有關也就不足為奇了,因為這肯定是知情文本中的相關術語,
讓我們嘗試通知代表用戶剛剛觀看的電影的檔案:
GET movies/_search
{
"fields": [
"overview"
],
"query": {
"match": {
"title": "rocky"
}
},
"_source": false
}
在上面,我們搜索檔案, 并查看結果:
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1366",
"_score" : 11.072304,
"fields" : {
"overview" : [
"When world heavyweight boxing champion, Apollo Creed wants to give an unknown fighter a shot at the title as a publicity stunt, his handlers choose palooka Rocky Balboa, an uneducated collector for a Philadelphia loan shark. Rocky teams up with trainer Mickey Goldmill to make the most of this once in a lifetime break."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1371",
"_score" : 9.325746,
"fields" : {
"overview" : [
"Now the world champion, Rocky Balboa is living in luxury and only fighting opponents who pose no threat to him in the ring. His lifestyle of wealth and idleness is shaken when a powerful young fighter known as Clubber Lang challenges him to a bout. After taking a pounding from Lang, the humbled champ turns to former bitter rival Apollo Creed to help him regain his form for a rematch with Lang."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "41288",
"_score" : 9.325746,
"fields" : {
"overview" : [
"""Step into the ring with one of America's greatest legends...and stand a couple of rounds with greatness! "Pulling no punches" (LA Daily News), Jon Favreau (Swingers) and Oscar?(r) winner* George C. Scott give TKO performances in this outstanding biography of the only undefeated world heavyweight champion in the history of boxing! In the small blue-collar town of Brockton, Massachusetts, young Rocky Marciano (Favreau) turns to the ring as his ticket out. Training twice as hard and twice as long as anyone else, he pounds his way to victory and his reputation quickly spreads as "the guy to beat." But behind the gloves Rocky is unhappy with his gift and he's thinking of retiring. So, with the fate of his career hanging in the balance, he finds a way to unleash his thunder againthis time against his biggest hero: Joe Louis!"""
]
}
}
...
]
在上面,我們可以看到一個 _id 為 1366 的檔案,我們接下來查找和這個 _id 相似的檔案,我們可以這么來查詢:
GET movies/_search
{
"size": 5,
"fields": [
"overview"
],
"query": {
"more_like_this": {
"fields": [
"overview"
],
"like": {
"_index": "movies",
"_id": "1366"
},
"min_term_freq": 1,
"max_query_terms": 12
}
},
"_source": false
}
用戶肯定有可能對特許經營中的其他電影感興趣,這就是我們得到的結果:
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "312221",
"_score" : 57.172073,
"fields" : {
"overview" : [
"The former World Heavyweight Champion Rocky Balboa serves as a trainer and mentor to Adonis Johnson, the son of his late friend and former rival Apollo Creed."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "184741",
"_score" : 29.255241,
"fields" : {
"overview" : [
"A chorus girl (Marion Davies) and a heavyweight boxer (Clark Gable) are paired romantically as a publicity stunt."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1371",
"_score" : 27.638262,
"fields" : {
"overview" : [
"Now the world champion, Rocky Balboa is living in luxury and only fighting opponents who pose no threat to him in the ring. His lifestyle of wealth and idleness is shaken when a powerful young fighter known as Clubber Lang challenges him to a bout. After taking a pounding from Lang, the humbled champ turns to former bitter rival Apollo Creed to help him regain his form for a rematch with Lang."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1246",
"_score" : 26.655176,
"fields" : {
"overview" : [
"When he loses a highly publicized virtual boxing match to ex-champ Rocky Balboa, reigning heavyweight titleholder, Mason Dixon retaliates by challenging Rocky to a nationally televised, 10-round exhibition bout. To the surprise of his son and friends, Rocky agrees to come out of retirement and face an opponent who's faster, stronger and thirty years his junior."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1367",
"_score" : 25.732662,
"fields" : {
"overview" : [
"""After Rocky goes the distance with champ Apollo Creed, both try to put the fight behind them and move on. Rocky settles down with Adrian but can't put his life together outside the ring, while Creed seeks a rematch to restore his reputation. Soon enough, the "Master of Disaster" and the "Italian Stallion" are set on a collision course for a climactic battle that is brutal and unforgettable."""
]
}
}
]
總結
more_like_this 查詢具有很大的潛力,可以在我們的搜索解決方案中提供額外的功能,如果一方面它使用起來非常簡單,另一方面它提供了一個有趣的引數來專門針對我們搜索類似的檔案, 這個查詢也用于 NLP 的背景關系中,更具體地用于文本分類
無論如何,我希望這篇文章引起了人們對 Elasticsearch 上可用的這種和其他型別的查詢進行試驗的興趣,盡管這些查詢有點不常用,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/290603.html
標籤:其他
上一篇:Hive join幾種形式詳解
下一篇:大資料發行版本與云廠商
