Skip to main content

Lucene, sample JAVA code to Index a file folder


Please find below the Lucene sample code to index the files inside a folder. This code will index ( or create fields for ) the file path, file title, modified date and contents of the file.

This java code is expecting the index path ( where the index files will be created ) and file folder path as program arguments like  "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH]" .

The logic of the code is to iterate through each file in the folder and call the method indexDoc(), where the above said fields are created and added to a Document object. This means that for each file there will be a document object and these document objects will be added to IndexWriter.

Please find below the screen shot of the indexd file folder :



import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
 public static void main(String[] args) {
  String usage = "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"
   + "This indexes the documents in DOCS_PATH, creating a Lucene index in"
   + "INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out.println("Document directory "
   + docDir.getAbsolutePath()
   + "does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");
   Directory dir = FSDirectory.open(new File(indexPath));

   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31,analyzer);
   iwc.setOpenMode(OpenMode.CREATE);
   IndexWriter writer = new IndexWriter(dir, iwc);
   findFilesAndIndex(writer, docDir);

   writer.close();
   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()+ " total milliseconds");
  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }
 }

 static void findFilesAndIndex(IndexWriter writer, File file) throws IOException {
  FileInputStream fis = null;
  try{
  if (file.canRead()) {
   if (file.isDirectory()) {
   String[] files = file.list();
   if (files != null) {
    for (int i = 0; i < files.length; i++) {
    findFilesAndIndex(writer, new File(file, files[i]));
    }
   }
   } else {
    fis = new FileInputStream(file);
    indexDoc(writer, file,fis);
   }
  }
  }catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }finally {
   if(fis != null){
    fis.close();
   }
  }
 }

 static void indexDoc(IndexWriter writer, File file,FileInputStream fis) throws IOException {
  Document doc = new Document();
  Field pathField = new Field("path", file.getPath(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(pathField);

  Field titleField = new Field("title", file.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(titleField);

  NumericField modifiedField = new NumericField("modified");
  modifiedField.setLongValue(file.lastModified());
  doc.add(modifiedField);

  doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

  System.out.println("adding " + file);
  writer.addDocument(doc);
 }
}

Comments

  1. Exact code i was looking for, awesome logic, thanks for the share.Sample Documents

    ReplyDelete

Post a Comment

Popular posts from this blog

ATG Search - high level overview of product-catalog-output-config.xml and XHTMLs

The definition file format begins with a top-level item element that specifies the repository and item descriptor to use, and then lists the properties of that item type to include. The top-level item element has the is-document attribute set to true. This attribute specifies that an XHTML document should be generated for each item of that type (in this case, each user item). Property values that come from standard JavaBean properties of the RepositoryItem object (rather than dynamic bean properties) are specified using a dollar-sign ($) prefix. The item element has an is-multi attribute for specifying multi-value properties. If a property is an Array, Collection or Map  you should set this attribute to true. Eeach ATG Search document is uniquely identified by a URL (typically the path name of the file on the file system). In the XHTML documents that the ATG platform generates from repository items, meta properties are represented by meta tags in the head of the document, whi...

ATG - more about Forms and Form Handlers

An ATG form is defined by the dsp:form tag, which typically encloses DSP tags that specify form elements, such as dsp:input that provide direct access to Nucleus component properties. Find below a sample dsp:form tag.    <dsp:form action="/testPages/showPersonProperties.jsp" method="post" target="_top">      <p>Name: <dsp:input bean="/samples/Person.name" type="text"/>      <p>Age: <dsp:input bean="/samples/Person.age" type="text" value="30"/>      <p><dsp:input type="submit" bean="/samples/Person.submit"/> value="Click to submit"/>    </dsp:form>   When the user submits the form, the /samples/Person.name property is set to the value entered in the input field.Unlike standard HTML, which requires the name attribute for most input tags; the name attribute is optional for DSP form element tags. If an input tag omits the n...

Eclipse plug-in to create Class and Sequence diagrams

ModelGoon is an Eclipse plug-in avaiable for UML diagram generation from Java code. It can be used to generate Package Dependencies Diagram, Class Diagram, Interaction Diagram and Sequence Diagram. You coud get it from http://marketplace.eclipse.org/content/modelgoon-uml4java Read more about it and see some vedios about how to create the class and sequence diagram @ http://www.modelgoon.org/?tag=eclipse-plugin Find some snapshots below which gives an idea about the diagram generation.

Google Chrome shortcut keys

If you are a Google Chromey guy, please find below the list of shortcut keys for some of the most used features  :-) Find more shortcut keys @  http://www.google.com/support/chrome/bin/static.py?page=guide.cs&guide=25799&topic=28650

ATG Product Catalog schema ER diagram

Check out the O rder schema ER-Diagram @   http://tips4ufromsony.blogspot.in/2012/02/atg-order-schema-er-diagram.html Check out the User Profile  schema ER-Diagram @ http://tips4ufromsony.blogspot.in/2012/03/atg-user-profile-schema-er-diagram.html If you would like to know the relationship between different Product Catalog tables, please find below screen shots of  Product Catalog schema ER Diagrams.