luceneインデックス

18187 ワード

IndexWriter

IndexerWriterは

luceneでインデックスを担当するクラス.このクラスを使用して、ドキュメントの内容をインデックスします.
△このクラスでは、検索した内容は確認できません.後で検索を担当するクラスが確認できます.
ファイル形式で格納され、luceneとして格納される検索格納資料構造segments.この資料の構造は後で詳しく述べる.
本のLucene version 3.0コードをversion 8.6.2規格に変更します.

作成者メソッド

    private IndexWriter indexWriter;
    public Indexer(String indexDir) throws IOException {
        // 변경 코드 open(new File) -> open(Path)
        Directory dir = FSDirectory.open(Paths.get(indexDir));
        indexWriter = new IndexWriter(dir, new IndexWriterConfig(new StandardAnalyzer()));
    }

    public void close() throws IOException{
        indexWriter.close();
    }

IndexerWriterを初期化し、StandardAnalyzerに設定します.

インデックス作成プロセスメソッド

    public int index(String dataDir, FileFilter filter) throws Exception {
        File[] files = new File(dataDir).listFiles();
        for(File file : files){
            if(!file.isDirectory() && !file.isHidden() && file.exists() && file.canRead() && (filter == null || filter.accept(file))){
                indexFile(file);
            }
        }
        //변경 코드 indexerWriter.numDocs() -> indexerWriter.getDocStats().numDocs
        return indexWriter.getDocStats().numDocs;
    }

ループインデックスフォルダのファイルリストを指定します.戻り値は、インデックス・ファイルの数です.

フィールド設定とインデックス方法

    public void indexFile(File file) throws Exception {
        System.out.println("Indexing " + file.getCanonicalPath());
        Document doc = getDocument(file);
        indexWriter.addDocument(doc);
    }
    public Document getDocument(File file) throws Exception {
        Document doc = new Document();
        //변경코드 new Field() -> new TextField()
        //변경코드 Field.Index.NOT_ANALYZED -> 삭제됨
		doc.add(new TextField("contentsFile", new FileReader(file)));
        doc.add(new StringField("contentsString", FileUtils.readFileToString(file, StandardCharsets.UTF_8), Field.Store.YES));
        doc.add(new StringField("filename", file.getName(), Field.Store.YES));

        return doc;
    }

new TextField(「field」)とnew FileReader(「filename」)を使用すると、ストレージ領域はUn-Storeになります.
つまり、このフィールドにインデックスを付けると、フィールドを検索する役割は可能ですが、露出フィールドとしては使用できません.
露出フィールドを作成する場合は、新しいStringField(「field」、「file contents」、「Store」)でなければなりません.
Documentを作成し、ファイル内容、ファイルタイトルのインデックスドキュメントを設定し、IndexerWriterクラスに追加してインデックスします.

Field.Store( リファレンスリンク )

Store.YES:インデックスを作成するすべての値をインデックスに格納します.検索結果など、必ず見なければならない内容であれば使用します.

Store.NO:値は保存しません.Indexオプションと混合して使用すると、元の記事は必要ありませんが、検索できます.

Store.COMPRESS:ストレージ値を圧縮します.文章の内容が大きい、バイナリファイルなどを保存するために使用します.

Field( リファレンスリンク )

StringField:indexに含まれていますが、タグは付けられません.string全体がtoken(ESのkeyword)

のようです

TextField:indexに含めてタグ付けします.term vectorは生成されません.

StoredField:Valueは格納されているので、IndexSearcherです.doc(int)とIndexReader.document()を使用して、フィールドと値を返すことができます.
主に数字を入力します.(TextとStringには別のフィールドがあります)
ただし、StoredFieldでは、範囲検索やソートはできません.これらの機能を実現するには、xxxDocValueフィールドを同時に使用する必要があります.

FileFilter Override

    private static class TextFilesFilter implements FileFilter{
        @Override
        public boolean accept(File file) {
            return file.getName().toLowerCase().endsWith(".txt");
        }
    }

ファイル拡張子txtのファイルのみをインデックスターゲットに設定します.

主な方法

    public static void main(String[] args) throws Exception{
        String indexDir = "/data/test/index_data";       // 해당 디렉토리에 색인 파일 생성
        String dataDir = "/data/test";                   // 해당 디렉토리의 파일을 대상으로 색인 지정

        Indexer indexer = new Indexer(indexDir);
        int count = 0;
        try{
            count = indexer.index(dataDir, new TextFilesFilter());
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            indexer.close();
        }
        System.out.println("Indexing num " + count);
    }

TextFileFilterを使用して、txtという拡張子のファイルのみをインデックスするプロセスを実行し、値インデックスを返すドキュメントの数値を取得します.

結果出力文

Task :Indexer.main()
Indexing /System/Volumes/Data/data/test/test1.txt
Indexing /System/Volumes/Data/data/test/test2.txt
Indexing num 2

結果ファイルの生成

Reference

この問題について(luceneインデックス), 我々は、より多くの情報をここで見つけました https://velog.io/@mertyn88/lucene-색인

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

JPA - Auditing

linuxメモリ、cpuの表示