Skip to main content

Lucene, sample JAVA code to Index a file folder


Please find below the Lucene sample code to index the files inside a folder. This code will index ( or create fields for ) the file path, file title, modified date and contents of the file.

This java code is expecting the index path ( where the index files will be created ) and file folder path as program arguments like  "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH]" .

The logic of the code is to iterate through each file in the folder and call the method indexDoc(), where the above said fields are created and added to a Document object. This means that for each file there will be a document object and these document objects will be added to IndexWriter.

Please find below the screen shot of the indexd file folder :



import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
 public static void main(String[] args) {
  String usage = "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"
   + "This indexes the documents in DOCS_PATH, creating a Lucene index in"
   + "INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out.println("Document directory "
   + docDir.getAbsolutePath()
   + "does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");
   Directory dir = FSDirectory.open(new File(indexPath));

   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31,analyzer);
   iwc.setOpenMode(OpenMode.CREATE);
   IndexWriter writer = new IndexWriter(dir, iwc);
   findFilesAndIndex(writer, docDir);

   writer.close();
   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()+ " total milliseconds");
  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }
 }

 static void findFilesAndIndex(IndexWriter writer, File file) throws IOException {
  FileInputStream fis = null;
  try{
  if (file.canRead()) {
   if (file.isDirectory()) {
   String[] files = file.list();
   if (files != null) {
    for (int i = 0; i < files.length; i++) {
    findFilesAndIndex(writer, new File(file, files[i]));
    }
   }
   } else {
    fis = new FileInputStream(file);
    indexDoc(writer, file,fis);
   }
  }
  }catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }finally {
   if(fis != null){
    fis.close();
   }
  }
 }

 static void indexDoc(IndexWriter writer, File file,FileInputStream fis) throws IOException {
  Document doc = new Document();
  Field pathField = new Field("path", file.getPath(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(pathField);

  Field titleField = new Field("title", file.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(titleField);

  NumericField modifiedField = new NumericField("modified");
  modifiedField.setLongValue(file.lastModified());
  doc.add(modifiedField);

  doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

  System.out.println("adding " + file);
  writer.addDocument(doc);
 }
}

Comments

  1. Exact code i was looking for, awesome logic, thanks for the share.Sample Documents

    ReplyDelete

Post a Comment

Popular posts from this blog

How to convert your Blogger Blog to PDF ?

You can use a website called "blogbooker" @  http://www.blogbooker.com/blogger.php   to convert your Blogger Blog to a PDF . Please find the steps below : 1. Save your blog as an xml using Blogger Settings - Other - Export Blog option 2. Go to the website " http://www.blogbooker.com/blogger.php " and select this XML , give your blog address and select the options like date range, page size, font, ... 3. Click the  "Create Your BlogBook" button to view and save your blog as PDF

ATG Product Catalog schema ER diagram

Check out the O rder schema ER-Diagram @   http://tips4ufromsony.blogspot.in/2012/02/atg-order-schema-er-diagram.html Check out the User Profile  schema ER-Diagram @ http://tips4ufromsony.blogspot.in/2012/03/atg-user-profile-schema-er-diagram.html If you would like to know the relationship between different Product Catalog tables, please find below screen shots of  Product Catalog schema ER Diagrams.

SOAP UI faster start up

If you feel like your SOAP UI is starting up very slowly, check whether this is due to any start up web page call. You can check this @ Preferences - UI Settings - Show Startup Page ==> Here you can deselect this option to improve the start-up time.

ATG License Files and Oracle Software Delivery Cloud

Oracle no longer generates license keys that are specific to your IP address(es). Oracle now provides generic license files that enable you to fully utilize all of the features for which you are licensed. Please find the ATG License files for different ATG versions @ http://www.oracle.com/us/support/licensecodes/atg/index. @ Oracle Software Delivery Cloud , you can find downloads for all licensable Oracle products –> https://edelivery.oracle.com/ Please find below a screen shot for ATG products download :

ATG User Profile schema ER diagram

Check out the Product Catalog  schema ER-Diagram @  http://tips4ufromsony.blogspot.in/2012/01/atg-product-catalog-schema-er-diagram.html Check out the O rder schema ER-Diagram @   http://tips4ufromsony.blogspot.in/2012/02/atg-order-schema-er-diagram.html If you would like to know the relationship between different User Profile schema tables, please find below screen shot of  Profile schema ER Diagrams.