Skip to main content

Lucene, sample JAVA code to Index a file folder


Please find below the Lucene sample code to index the files inside a folder. This code will index ( or create fields for ) the file path, file title, modified date and contents of the file.

This java code is expecting the index path ( where the index files will be created ) and file folder path as program arguments like  "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH]" .

The logic of the code is to iterate through each file in the folder and call the method indexDoc(), where the above said fields are created and added to a Document object. This means that for each file there will be a document object and these document objects will be added to IndexWriter.

Please find below the screen shot of the indexd file folder :



import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
 public static void main(String[] args) {
  String usage = "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"
   + "This indexes the documents in DOCS_PATH, creating a Lucene index in"
   + "INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out.println("Document directory "
   + docDir.getAbsolutePath()
   + "does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");
   Directory dir = FSDirectory.open(new File(indexPath));

   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31,analyzer);
   iwc.setOpenMode(OpenMode.CREATE);
   IndexWriter writer = new IndexWriter(dir, iwc);
   findFilesAndIndex(writer, docDir);

   writer.close();
   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()+ " total milliseconds");
  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }
 }

 static void findFilesAndIndex(IndexWriter writer, File file) throws IOException {
  FileInputStream fis = null;
  try{
  if (file.canRead()) {
   if (file.isDirectory()) {
   String[] files = file.list();
   if (files != null) {
    for (int i = 0; i < files.length; i++) {
    findFilesAndIndex(writer, new File(file, files[i]));
    }
   }
   } else {
    fis = new FileInputStream(file);
    indexDoc(writer, file,fis);
   }
  }
  }catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }finally {
   if(fis != null){
    fis.close();
   }
  }
 }

 static void indexDoc(IndexWriter writer, File file,FileInputStream fis) throws IOException {
  Document doc = new Document();
  Field pathField = new Field("path", file.getPath(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(pathField);

  Field titleField = new Field("title", file.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(titleField);

  NumericField modifiedField = new NumericField("modified");
  modifiedField.setLongValue(file.lastModified());
  doc.add(modifiedField);

  doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

  System.out.println("adding " + file);
  writer.addDocument(doc);
 }
}

Comments

  1. Exact code i was looking for, awesome logic, thanks for the share.Sample Documents

    ReplyDelete

Post a Comment

Popular posts from this blog

Google Chrome shortcut keys

If you are a Google Chromey guy, please find below the list of shortcut keys for some of the most used features  :-) Find more shortcut keys @  http://www.google.com/support/chrome/bin/static.py?page=guide.cs&guide=25799&topic=28650

ATG Search - search engine tuning settings

In this blog, I am going to list the best tuning settings for ATG Search engine. The AESoapConfig.xml, AESoapWaspConfig.xml  and AEConfig.xml are the xmls referred below and you can find it @  <ATG_DIR>\<Searchx.x>\SearchEngine\<operating_system>\bin\ folder. (1)  Make sure that the AESoapConfig.xml's rwTimeout is less than or equal to routing's readTimeoutMs. You could find the routing's readTimeoutMs @ atg\search\routing\SearchEngineService component.               rwTimeout is the  length of time in seconds to wait before a read or write operation times out on an active connection. The number can be decreased to improve performance. However, a value that is too low could cause slow connections to be prematurely closed. (2)  Adjust the number of engine threads to match the number of CPUs available to the engine. Note that the minimal value for maxThreads and maxSpar...

ATG User Profile schema ER diagram

Check out the Product Catalog  schema ER-Diagram @  http://tips4ufromsony.blogspot.in/2012/01/atg-product-catalog-schema-er-diagram.html Check out the O rder schema ER-Diagram @   http://tips4ufromsony.blogspot.in/2012/02/atg-order-schema-er-diagram.html If you would like to know the relationship between different User Profile schema tables, please find below screen shot of  Profile schema ER Diagrams.  

ATG - how to create and deploy a new atg module

ATG products are packaged as a number of separate application modules. Application modules exist in the ATG installation as a set of directories defined by a manifest file. To create a new module, follow the below steps : Create a module directory within your ATG installation.  Create a META-INF directory within the module directory. Note that this directory must be named META-INF.  Create a manifest file named MANIFEST.MF and include it in the META-INF directory for the module. The manifest contains the meta-data describing the module. A module located at <ATG2007.1dir>/MyModule is named MyModule and a module located at <ATG2007.1dir>/CustomModules/MyModule is named CustomModules.MyModule. Within the subdirectory that holds the module, any number of files may reside in any desired order. These files are the module resources (EAR files for J2EE applications, WAR files for web applications, EJB-JAR files for Enterprise JavaBeans, JAR files of Java class...

ATG : Oracle started new discussion forum for ATG

Oracle has started a new ATG discussion forum on oracle discussion forums.  It has a main ATG section and is divided into technical and business categories. You can access the url   http://forums.oracle.com/forums/category.jspa?categoryID=503 .  After Oracle acquired ATG, this was much expected and we can hope this forum might give us a better chance to discuss our ATG doubts and more people will come and discuss about ATG. Find the ATG docs @   http://www.oracle.com/technetwork/indexes/documentation/atgwebcommerce-393465.html