Skip to main content

Lucene, sample JAVA code to Index a file folder


Please find below the Lucene sample code to index the files inside a folder. This code will index ( or create fields for ) the file path, file title, modified date and contents of the file.

This java code is expecting the index path ( where the index files will be created ) and file folder path as program arguments like  "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH]" .

The logic of the code is to iterate through each file in the folder and call the method indexDoc(), where the above said fields are created and added to a Document object. This means that for each file there will be a document object and these document objects will be added to IndexWriter.

Please find below the screen shot of the indexd file folder :



import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
 public static void main(String[] args) {
  String usage = "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"
   + "This indexes the documents in DOCS_PATH, creating a Lucene index in"
   + "INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out.println("Document directory "
   + docDir.getAbsolutePath()
   + "does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");
   Directory dir = FSDirectory.open(new File(indexPath));

   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31,analyzer);
   iwc.setOpenMode(OpenMode.CREATE);
   IndexWriter writer = new IndexWriter(dir, iwc);
   findFilesAndIndex(writer, docDir);

   writer.close();
   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()+ " total milliseconds");
  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }
 }

 static void findFilesAndIndex(IndexWriter writer, File file) throws IOException {
  FileInputStream fis = null;
  try{
  if (file.canRead()) {
   if (file.isDirectory()) {
   String[] files = file.list();
   if (files != null) {
    for (int i = 0; i < files.length; i++) {
    findFilesAndIndex(writer, new File(file, files[i]));
    }
   }
   } else {
    fis = new FileInputStream(file);
    indexDoc(writer, file,fis);
   }
  }
  }catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }finally {
   if(fis != null){
    fis.close();
   }
  }
 }

 static void indexDoc(IndexWriter writer, File file,FileInputStream fis) throws IOException {
  Document doc = new Document();
  Field pathField = new Field("path", file.getPath(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(pathField);

  Field titleField = new Field("title", file.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(titleField);

  NumericField modifiedField = new NumericField("modified");
  modifiedField.setLongValue(file.lastModified());
  doc.add(modifiedField);

  doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

  System.out.println("adding " + file);
  writer.addDocument(doc);
 }
}

Comments

  1. Exact code i was looking for, awesome logic, thanks for the share.Sample Documents

    ReplyDelete

Post a Comment

Popular posts from this blog

ATG - more about Forms and Form Handlers

An ATG form is defined by the dsp:form tag, which typically encloses DSP tags that specify form elements, such as dsp:input that provide direct access to Nucleus component properties. Find below a sample dsp:form tag.    <dsp:form action="/testPages/showPersonProperties.jsp" method="post" target="_top">      <p>Name: <dsp:input bean="/samples/Person.name" type="text"/>      <p>Age: <dsp:input bean="/samples/Person.age" type="text" value="30"/>      <p><dsp:input type="submit" bean="/samples/Person.submit"/> value="Click to submit"/>    </dsp:form>   When the user submits the form, the /samples/Person.name property is set to the value entered in the input field.Unlike standard HTML, which requires the name attribute for most input tags; the name attribute is optional for DSP form element tags. If an input tag omits the n...

ATG CA - BCC home screen : how to add a new link

          Activity source is the property which controls the links on the left nav on the BCC home screen. All activity sources are registered with the ActivityManager component at /atg/bizui/activity/ActivityManager . When rendering the BCC home page, the ActivityManager cycles through all the registered ActivitySource components and displays left navigation links for each of them on the BCC home page. For example if I want to add a new link "My New Link" , below screen shots exaplins how this can be done 1. Add  activityManager.properties to specify the activityresources. In this  activityManager, I specified one MyActivitySource. 2. Add  MyActivitySource.properties  to specify the name of the link and the other details . Here it refers to a bundle properties file.  3. Add  the bundle properties file  to specify the name of the link.  4. Now you could see the new link...

ATG Search architectural flow : Search and Index

I would like to explain the high level ATG Search implementation architecture ( for an online store) through the above diagram. In this diagram 1.x denotes the search functionality and 2.x denotes the indexing functionality. I have given JBoss as the application server. Physical Boxes and Application Servers in the diagram ( as recommended by ATG )  : Estore ( Commerce ) Box --> The box with the estore/site ear (with the site JSPs and Java codes). Search Engine Box --> The box with the search engine application running. Indexing Engine Box --> The box with the indexing engine application running. CA (Content Administration) Box --> The box with the ATG CA ear ( where we could take CA -BCC - Search Administration and configure the search projects) . Search Indexer Box --> The box with the ATG Search Index ear ( to fetch the index data from repository). Note that the engine performing indexing will need access ...

ATG Search - search engine tuning settings

In this blog, I am going to list the best tuning settings for ATG Search engine. The AESoapConfig.xml, AESoapWaspConfig.xml  and AEConfig.xml are the xmls referred below and you can find it @  <ATG_DIR>\<Searchx.x>\SearchEngine\<operating_system>\bin\ folder. (1)  Make sure that the AESoapConfig.xml's rwTimeout is less than or equal to routing's readTimeoutMs. You could find the routing's readTimeoutMs @ atg\search\routing\SearchEngineService component.               rwTimeout is the  length of time in seconds to wait before a read or write operation times out on an active connection. The number can be decreased to improve performance. However, a value that is too low could cause slow connections to be prematurely closed. (2)  Adjust the number of engine threads to match the number of CPUs available to the engine. Note that the minimal value for maxThreads and maxSpar...

ATG Search Indexing - overview of different steps in search indexing

Read more about the search indexing behind the scene steps @  http://tips4ufromsony.blogspot.in/2011/12/atg-search-indexing-behind-scene-steps.html ATG Search prepares searchable content by indexing the products specified in the XML definition file (/atg/commerce/search/ProductCatalogOutputConfig). Generally there are two types of indexing 1.  Full Indexing  --> all data taken for indexing 2.  Incremental Indexing --> only changed data will be taken for indexing When full indexing is triggered, following happens:    1. The out of box component BulkLoader will call IndexedItemsGroup.getGroupMembers() to load the products to the XHTL document. It prevents uncategorized products from getting indexed. The definition file format begins with a top-level item as a product and includes the properties of parent category and childskus. For each product, the set of Variant Producers configured in ProductCatalogOutputConfig is execute...