Skip to main content

Lucene, sample JAVA code to Index a file folder


Please find below the Lucene sample code to index the files inside a folder. This code will index ( or create fields for ) the file path, file title, modified date and contents of the file.

This java code is expecting the index path ( where the index files will be created ) and file folder path as program arguments like  "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH]" .

The logic of the code is to iterate through each file in the folder and call the method indexDoc(), where the above said fields are created and added to a Document object. This means that for each file there will be a document object and these document objects will be added to IndexWriter.

Please find below the screen shot of the indexd file folder :



import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
 public static void main(String[] args) {
  String usage = "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"
   + "This indexes the documents in DOCS_PATH, creating a Lucene index in"
   + "INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out.println("Document directory "
   + docDir.getAbsolutePath()
   + "does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");
   Directory dir = FSDirectory.open(new File(indexPath));

   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31,analyzer);
   iwc.setOpenMode(OpenMode.CREATE);
   IndexWriter writer = new IndexWriter(dir, iwc);
   findFilesAndIndex(writer, docDir);

   writer.close();
   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()+ " total milliseconds");
  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }
 }

 static void findFilesAndIndex(IndexWriter writer, File file) throws IOException {
  FileInputStream fis = null;
  try{
  if (file.canRead()) {
   if (file.isDirectory()) {
   String[] files = file.list();
   if (files != null) {
    for (int i = 0; i < files.length; i++) {
    findFilesAndIndex(writer, new File(file, files[i]));
    }
   }
   } else {
    fis = new FileInputStream(file);
    indexDoc(writer, file,fis);
   }
  }
  }catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }finally {
   if(fis != null){
    fis.close();
   }
  }
 }

 static void indexDoc(IndexWriter writer, File file,FileInputStream fis) throws IOException {
  Document doc = new Document();
  Field pathField = new Field("path", file.getPath(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(pathField);

  Field titleField = new Field("title", file.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(titleField);

  NumericField modifiedField = new NumericField("modified");
  modifiedField.setLongValue(file.lastModified());
  doc.add(modifiedField);

  doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

  System.out.println("adding " + file);
  writer.addDocument(doc);
 }
}

Comments

  1. Exact code i was looking for, awesome logic, thanks for the share.Sample Documents

    ReplyDelete

Post a Comment

Popular posts from this blog

ATG : Oracle started new discussion forum for ATG

Oracle has started a new ATG discussion forum on oracle discussion forums.  It has a main ATG section and is divided into technical and business categories. You can access the url   http://forums.oracle.com/forums/category.jspa?categoryID=503 .  After Oracle acquired ATG, this was much expected and we can hope this forum might give us a better chance to discuss our ATG doubts and more people will come and discuss about ATG. Find the ATG docs @   http://www.oracle.com/technetwork/indexes/documentation/atgwebcommerce-393465.html

Eclipse plug-in to create Class and Sequence diagrams

ModelGoon is an Eclipse plug-in avaiable for UML diagram generation from Java code. It can be used to generate Package Dependencies Diagram, Class Diagram, Interaction Diagram and Sequence Diagram. You coud get it from http://marketplace.eclipse.org/content/modelgoon-uml4java Read more about it and see some vedios about how to create the class and sequence diagram @ http://www.modelgoon.org/?tag=eclipse-plugin Find some snapshots below which gives an idea about the diagram generation.

Google Chrome shortcut keys

If you are a Google Chromey guy, please find below the list of shortcut keys for some of the most used features  :-) Find more shortcut keys @  http://www.google.com/support/chrome/bin/static.py?page=guide.cs&guide=25799&topic=28650

MPC way of subtitle download

How can we easily download the subtitle of a video ? If you are using the Media Player Classic, there is an easy option to download subtitle (if you are already connected to internet). Go to File --> Subtitle Databse --> Download. Now a list of subtitles will be listed including the language and you can choose the one and also can replace the existing one (if any). You have the option to save this subtitle (@  File --> Save Subtitle). Please find below some snaps:

ATG Search related repositories

Other than the ProductCatalog repository, ATG search uses the SearchConfigurationRepository, SearchAdminRepository and RefinementRepository. Search Administration requires access to all these 3 repositories. The client application must have access to the routing repository.  Please find below the details of each repository.    1. SearchConfigurationRepository :               #  \atg\search\routing\repository\SearchConfigurationRepository.               #  It stores information about search engines, index structure and deployment information.               #  refers to all “ rout_* "  tables.    2. SearchAdminRepository :                #  \atg\searchadmin\SearchAdminRepository.               #  It stores information about search adminis...