【1K資料集+SpringBoot+Thymeleaf】基于全文檢索技術lucene開發的搜索引擎

? 小編最近寫的專案=============》有問題咨詢可以加小哥微信：JL1714873054，如果覺得不錯可以點一波關注，謝謝，后臺看到私信一定會回復，沒有回復請諒解，可以關注我的微信公眾號：CodeLinghu ，預約畢設，預約專案，查看精彩的專案博客分享，

一、需求分析

實作一個搜索框，能夠索引指定資料集（資料取自資料庫中）
實作索引內容的展示，圖片展示
實作文本分類，排序等基礎功能
索引資料量>1K （可自行爬取）
每次完成搜索能夠進行一次評價一次檢索效率
可做畢設/可做專案/大創

1.1、專案效果圖展示

1.1.1前端頁面展示：

1.1.2資料庫資料展示：

二、所用技術堆疊

SpringBoot
Thymeleaf模板渲染
mysql
ik中文分詞
lucene 索引
json轉換
HTML+CSS+JavaScript

2.1專案版本資訊：

jdk1.8
mysql 5.7.29
lucene 7.7.2
maven 3.6.3
Windows 10系統
IntellJ IDEA 2019 2.14
Navicat for mysql 12.1.2

三、專案技術理論基礎篇

本專案的重點是搜索/索引，所以我們首先認識一下所謂的搜索功能，

傳統的搜索功能流程如圖01-00.

? 圖01-00

以上搜索功能是目前企業中較為傳統的一種搜索方式，其特點就是資料量少，承載不了高并發，

本專案所采用的搜索理論基礎方案如圖01-01.

? 圖01-01

使用新方案的優勢：

降低了資料庫壓力
提升了資料庫訪問速度
通過lucene的API操作索引庫訪問資料庫實作了業務與資料的有效隔離

資料查詢有兩種方案：

順序查詢

所謂順序查詢就是通過用戶檢索的內容進行字串匹配，遍歷所有的檔案，當匹配到相同字串便查詢到當前檔案，沒有查到則繼續掃描下一個檔案，直到掃描完成所有檔案，

倒排索引

倒排索引是指先將海量資料進行分詞，形成一個索引表，查詢時先查詢索引表，通過索引表查詢指定檔案，這樣可以做到有效去重查詢相同內容文本的時間，為了做到倒排索引，我們才用的則是全文檢索技術------lucene

3.1、Lucene相關認識（需要你認識到）

lucene是一種技術架構，不是一個成型的技術產品，而是半產品，
lucene是一個工具包，我們可以利用它完成索引工具的開發，制作屬于自己的搜索引擎產品
Lucene在Java開發環境里是一個免費成熟的源代碼工具
Lucene可以通過官方網站下載，當然我也會提供下載包鏈接（）
Lucene是Apache公司的產品
Lucene實作全文檢索的基本流程圖：

原始檔案資料：
- 可以自行爬取資料，也可以用小哥提供的檔案資料
- 檔案資料放在小哥配套的檔案夾（DataSources）里，是一個mysql檔案，大家可以直接匯入mysql即可，
檔案：
- 拿到原始檔案資料是為了建立索引，在索引前需要將原始內容創建文當 Document，檔案 Document中包含了許多域 Field
分析檔案（分詞）：

分析檔案就是分詞，將檔案中的內容進行詞組劃分，
索引檔案：

索引檔案是為了更好地搜索，分詞形成了詞匯單元，通過索引詞匯單元快速找到需要被索引到的內容，

四、專案實戰篇

4.1 Lucene的下載

可以通過官方網站下載lucene，也可以在小哥留的資料包里下載

解壓后：

PS：queryparser：查詢決議器

使用以上三個檔案就可以實作本次專案中Lucene的功能，

4.2資料源下載

也在這個檔案夾下面：

匯入到mysql的效果：

4.3Java工程的創建

使用 DAO介面實作類獲取mysql中的資料:

package cn.linghu.dao;

import cn.linghu.pojo.Sku;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.ArrayList;
import java.util.List;

/**
 *
 */
public class SkuDaoImpl implements SkuDao {

    public List<Sku> querySkuList() {
        // 資料庫鏈接
        Connection connection = null;
        // 預編譯statement
        PreparedStatement preparedStatement = null;
        // 結果集
        ResultSet resultSet = null;
        // 商品串列
        List<Sku> list = new ArrayList<Sku>();

        try {
            // 加載資料庫驅動
            Class.forName("com.mysql.jdbc.Driver");
            // 連接資料庫
            connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/lucene", "root", "123456");

            // SQL陳述句
            String sql = "SELECT * FROM tb_sku";
            // 創建preparedStatement
            preparedStatement = connection.prepareStatement(sql);
            // 獲取結果集
            resultSet = preparedStatement.executeQuery();
            // 結果集決議
            while (resultSet.next()) {
                Sku sku = new Sku();
                sku.setId(resultSet.getString("id"));
                sku.setName(resultSet.getString("name"));
                sku.setSpec(resultSet.getString("spec"));
                sku.setBrandName(resultSet.getString("brand_name"));
                sku.setCategoryName(resultSet.getString("category_name"));
                sku.setImage(resultSet.getString("image"));
                sku.setNum(resultSet.getInt("num"));
                sku.setPrice(resultSet.getInt("price"));
                sku.setSaleNum(resultSet.getInt("sale_num"));
                list.add(sku);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

        return list;
    }
}

4.3.1核心代碼------實作索引流程：

\1. 采集資料

\2. 創建Document檔案物件

\3. 創建分析器（分詞器）

\4. 創建IndexWriterConfifig配置資訊類

\5. 創建Directory物件，宣告索引庫存盤位置

\6. 創建IndexWriter寫入物件

\7. 把Document寫入到索引庫中

\8. 釋放資源
    
package cn.linghu.test;

import cn.linghu.dao.SkuDao;
import cn.linghu.dao.SkuDaoImpl;
import cn.linghu.pojo.Sku;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.MMapDirectory;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;

/**
 * 索引庫維護
 */
public class TestIndexManager {


    /**
     * 創建索引庫
     */
    @Test
    public void createIndexTest() throws Exception {
        //1. 采集資料
        SkuDao skuDao = new SkuDaoImpl();
        List<Sku> skuList = skuDao.querySkuList();

        //檔案集合
        List<Document> docList = new ArrayList<>();

        for (Sku sku : skuList) {
            //2. 創建檔案物件
            Document document = new Document();

            //創建域物件并且放入檔案物件中
            /**
             * 是否分詞: 否, 因為主鍵分詞后無意義
             * 是否索引: 是, 如果根據id主鍵查詢, 就必須索引
             * 是否存盤: 是, 因為主鍵id比較特殊, 可以確定唯一的一條資料, 在業務上一般有重要所用, 所以存盤
             *      存盤后, 才可以獲取到id具體的內容
             */
            document.add(new StringField("id", sku.getId(), Field.Store.YES));

            /**
             * 是否分詞: 是, 因為名稱欄位需要查詢, 并且分詞后有意義所以需要分詞
             * 是否索引: 是, 因為需要根據名稱欄位查詢
             * 是否存盤: 是, 因為頁面需要展示商品名稱, 所以需要存盤
             */
            document.add(new TextField("name", sku.getName(), Field.Store.YES));

            /**
             * 是否分詞: 是(因為lucene底層演算法規定, 如果根據價格范圍查詢, 必須分詞)
             * 是否索引: 是, 需要根據價格進行范圍查詢, 所以必須索引
             * 是否存盤: 是, 因為頁面需要展示價格
             */
            document.add(new IntPoint("price", sku.getPrice()));
            document.add(new StoredField("price", sku.getPrice()));

            /**
             * 是否分詞: 否, 因為不查詢, 所以不索引, 因為不索引所以不分詞
             * 是否索引: 否, 因為不需要根據圖片地址路徑查詢
             * 是否存盤: 是, 因為頁面需要展示商品圖片
             */
            document.add(new StoredField("image", sku.getImage()));

            /**
             * 是否分詞: 否, 因為分類是專有名詞, 是一個整體, 所以不分詞
             * 是否索引: 是, 因為需要根據分類查詢
             * 是否存盤: 是, 因為頁面需要展示分類
             */
            document.add(new StringField("categoryName", sku.getCategoryName(), Field.Store.YES));

            /**
             * 是否分詞: 否, 因為品牌是專有名詞, 是一個整體, 所以不分詞
             * 是否索引: 是, 因為需要根據品牌進行查詢
             * 是否存盤: 是, 因為頁面需要展示品牌
             */
            document.add(new StringField("brandName", sku.getBrandName(), Field.Store.YES));

            //將檔案物件放入到檔案集合中
            docList.add(document);
        }
        //3. 創建分詞器, StandardAnalyzer標準分詞器, 對英文分詞效果好, 對中文是單字分詞, 也就是一個字就認為是一個詞.
        Analyzer analyzer = new IKAnalyzer();
        //4. 創建Directory目錄物件, 目錄物件表示索引庫的位置
        Directory  dir = FSDirectory.open(Paths.get("E:\\dir"));
        //5. 創建IndexWriterConfig物件, 這個物件中指定切分詞使用的分詞器
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        //6. 創建IndexWriter輸出流物件, 指定輸出的位置和使用的config初始化物件
        IndexWriter indexWriter = new IndexWriter(dir, config);
        //7. 寫入檔案到索引庫
        for (Document doc : docList) {
            indexWriter.addDocument(doc);
        }
        //8. 釋放資源
        indexWriter.close();
    }

    /**
     * 索引庫修改操作
     * @throws Exception
     */
    @Test
    public void updateIndexTest() throws Exception {
        //需要變更成的內容
        Document document = new Document();

        document.add(new StringField("id", "100000003145", Field.Store.YES));
        document.add(new TextField("name", "xxxx", Field.Store.YES));
        document.add(new IntPoint("price", 123));
        document.add(new StoredField("price", 123));
        document.add(new StoredField("image", "xxxx.jpg"));
        document.add(new StringField("categoryName", "手機", Field.Store.YES));
        document.add(new StringField("brandName", "華為", Field.Store.YES));


        //3. 創建分詞器, StandardAnalyzer標準分詞器, 對英文分詞效果好, 對中文是單字分詞, 也就是一個字就認為是一個詞.
        Analyzer analyzer = new StandardAnalyzer();
        //4. 創建Directory目錄物件, 目錄物件表示索引庫的位置
        Directory  dir = FSDirectory.open(Paths.get("E:\\dir"));
        //5. 創建IndexWriterConfig物件, 這個物件中指定切分詞使用的分詞器
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        //6. 創建IndexWriter輸出流物件, 指定輸出的位置和使用的config初始化物件
        IndexWriter indexWriter = new IndexWriter(dir, config);


        //修改, 第一個引數: 修改條件, 第二個引數: 修改成的內容
        indexWriter.updateDocument(new Term("id", "100000003145"), document);

        //8. 釋放資源
        indexWriter.close();
    }

    /**
     * 測驗根據條件洗掉
     * @throws Exception
     */
    @Test
    public void deleteIndexTest() throws Exception {
        //3. 創建分詞器, StandardAnalyzer標準分詞器, 對英文分詞效果好, 對中文是單字分詞, 也就是一個字就認為是一個詞.
        Analyzer analyzer = new StandardAnalyzer();
        //4. 創建Directory目錄物件, 目錄物件表示索引庫的位置
        Directory  dir = FSDirectory.open(Paths.get("E:\\dir"));
        //5. 創建IndexWriterConfig物件, 這個物件中指定切分詞使用的分詞器
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        //6. 創建IndexWriter輸出流物件, 指定輸出的位置和使用的config初始化物件
        IndexWriter indexWriter = new IndexWriter(dir, config);


        //測驗根據條件洗掉
        //indexWriter.deleteDocuments(new Term("id", "100000003145"));

        //測驗洗掉所有內容
        indexWriter.deleteAll();

        //8. 釋放資源
        indexWriter.close();
    }


    /**
     * 測驗創建索引速度優化
     * @throws Exception
     */
    @Test
    public void createIndexTest2() throws Exception {
        //1. 采集資料
        SkuDao skuDao = new SkuDaoImpl();
        List<Sku> skuList = skuDao.querySkuList();

        //檔案集合
        List<Document> docList = new ArrayList<>();

        for (Sku sku : skuList) {
            //2. 創建檔案物件
            Document document = new Document();
            document.add(new StringField("id", sku.getId(), Field.Store.YES));
            document.add(new TextField("name", sku.getName(), Field.Store.YES));
            document.add(new IntPoint("price", sku.getPrice()));
            document.add(new StoredField("price", sku.getPrice()));
            document.add(new StoredField("image", sku.getImage()));
            document.add(new StringField("categoryName", sku.getCategoryName(), Field.Store.YES));
            document.add(new StringField("brandName", sku.getBrandName(), Field.Store.YES));

            //將檔案物件放入到檔案集合中
            docList.add(document);
        }

        long start = System.currentTimeMillis();

        //3. 創建分詞器, StandardAnalyzer標準分詞器, 對英文分詞效果好, 對中文是單字分詞, 也就是一個字就認為是一個詞.
        Analyzer analyzer = new StandardAnalyzer();
        //4. 創建Directory目錄物件, 目錄物件表示索引庫的位置
        Directory  dir = FSDirectory.open(Paths.get("E:\\dir"));
        //5. 創建IndexWriterConfig物件, 這個物件中指定切分詞使用的分詞器
        /**
         * 沒有優化 小100萬條資料, 創建索引需要7725ms
         *
         */
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        //設定在記憶體中多少個檔案向磁盤中批量寫入一次資料
        //如果設定的數字過大, 會過多消耗記憶體, 但是會提升寫入磁盤的速度
        //config.setMaxBufferedDocs(500000);
        //6. 創建IndexWriter輸出流物件, 指定輸出的位置和使用的config初始化物件
        IndexWriter indexWriter = new IndexWriter(dir, config);
        //設定多少給檔案合并成一個段檔案,數值越大索引速度越快, 搜索速度越慢;  值越小索引速度越慢, 搜索速度越快
        //indexWriter.forceMerge(1000000);
        //7. 寫入檔案到索引庫
        for (Document doc : docList) {
            indexWriter.addDocument(doc);
        }
        //8. 釋放資源
        indexWriter.close();
        long end = System.currentTimeMillis();
        System.out.println("=====消耗的時間為:==========" + (end - start) + "ms");
    }

}

在E盤創建一個檔案夾名為 dir作為我們的索引檔案目錄，執行代碼成功之后， dir檔案夾內會出現如圖：

出現此圖表示創建索引成功！

4.3.2pom檔案中的依賴引入

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>cn.linghu</groupId>
    <artifactId>luceneDemo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <skipTests>true</skipTests>
    </properties>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.4.RELEASE</version>
    </parent>

    <dependencies>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>7.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>7.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>7.7.2</version>
        </dependency>

        <!-- 測驗 -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <!-- mysql資料庫驅動 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.48</version>
        </dependency>

        <!-- IK中文分詞器 -->
      <!--  <dependency>
            <groupId>org.wltea.ik-analyzer</groupId>
            <artifactId>ik-analyzer</artifactId>
            <version>8.1.0</version>
        </dependency>-->

        <!--web起步依賴-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <!-- 引入thymeleaf -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>
        <!-- Json轉換工具 -->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.51</version>
        </dependency>
    </dependencies>

</project>

后期還會進行專案總結，敬請期待，感謝支持！

跑專案之前需要在E盤建立一個dir檔案夾
需要配置好資料庫的資訊/賬戶/密碼
需要引入或匯入相關的jar包

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/272209.html

標籤：其他

上一篇：MySQL學習總結-基礎架構概述

下一篇：SpringBoot---(4) Spring Boot 集成SSM框架和Dubbo分布式框架

【原來那么簡單/大資料】隨隨便便開發一個屬于自己的搜索引擎