Skip to main content

Lucene, sample JAVA code to Search an indexed file folder


Please find below the Lucene sample JAVA code to search the files inside a folder. This code will search the indexed folder for a search query in an indexed field.

This java code is expecting the index path ( where the index files were created ) , field which need to be searched and the query need be searched as program arguments like  "java SearchFiles [-index dir] [-field f] [-query string]" .


import java.io.File;
import java.util.ArrayList;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class SearchFiles {

public static void main(String[] args){
try{
String usage = "Usage: SearchFiles [-index dir] [-field f] [-query string] \n\n";
String index = "index";
String field = "contents";
String queryString = null;
int hitsPerPage = 100;
for (int i = 0; i < args.length; i++) {
if ("-index".equals(args[i])) {
index = args[i + 1];
i++;
} else if ("-field".equals(args[i])) {
field = args[i + 1];
i++;
} else if ("-query".equals(args[i])) {
queryString = args[i + 1];
i++;
}
}
new SearchFiles().searchFiles(index,field,queryString,hitsPerPage);
}catch (Exception e) {
e.printStackTrace();
}
}

public ArrayList<String> searchFiles(String index,String field,
String queryString,int hitsPerPage) throws Exception {

ArrayList<String> returnStringList = new ArrayList<String>();

IndexSearcher searcher = new IndexSearcher(FSDirectory.open(new File(index)));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
QueryParser parser = new QueryParser(Version.LUCENE_31, field, analyzer);
Query query = parser.parse(queryString);
int numberOfPages = 5;

TopDocs results = searcher.search(query, numberOfPages * hitsPerPage);

ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
System.out.println(numTotalHits + " total matching documents");
returnStringList.add(numTotalHits + " total matching documents");

int start = 0;
int end = Math.min(numTotalHits, hitsPerPage);
for (int i = start; i < end; i++) {
Document doc = searcher.doc(hits[i].doc);
String path = doc.get("path");
if (path != null) {
System.out.println((i + 1) + ". " + path);
returnStringList.add((i + 1) + ". " + path);
String title = doc.get("title");
if (title != null) {
System.out.println("   Title: " + doc.get("title"));
returnStringList.add("   Title: " + doc.get("title"));
}
}
}
searcher.close();
return returnStringList;
}
}

Comments

Popular posts from this blog

Mozilla FireFox - how to add security certificate exception urls

If you visit a web site with a secure connection(https) and if the website's security certificate has some problem like the security certificate presented by the website was not issued by a trusted certificate authority  or  the security certificate has expired or is not yet valid, you will get an error page ( like below in IE ) with an option to continue to this website. When this non secure page is loading, in Internet Explorer, you will get an option to specify whether you need to download the page content files like JS, CSS, Images,... But you will not get such an option in FireFox/Chrome and could see only the text data in this new non secure page. If you are doing some local development with FireFox/Chrome and have such a situation, you might want to override this security restriction. FireFox provides some exception url list in secure certificates menu. Go to Options - Advanced - Encryption - View Certificates and click the exception list and...

ATG Search - search engine tuning settings

In this blog, I am going to list the best tuning settings for ATG Search engine. The AESoapConfig.xml, AESoapWaspConfig.xml  and AEConfig.xml are the xmls referred below and you can find it @  <ATG_DIR>\<Searchx.x>\SearchEngine\<operating_system>\bin\ folder. (1)  Make sure that the AESoapConfig.xml's rwTimeout is less than or equal to routing's readTimeoutMs. You could find the routing's readTimeoutMs @ atg\search\routing\SearchEngineService component.               rwTimeout is the  length of time in seconds to wait before a read or write operation times out on an active connection. The number can be decreased to improve performance. However, a value that is too low could cause slow connections to be prematurely closed. (2)  Adjust the number of engine threads to match the number of CPUs available to the engine. Note that the minimal value for maxThreads and maxSpar...

ATG Search architectural flow : Search and Index

I would like to explain the high level ATG Search implementation architecture ( for an online store) through the above diagram. In this diagram 1.x denotes the search functionality and 2.x denotes the indexing functionality. I have given JBoss as the application server. Physical Boxes and Application Servers in the diagram ( as recommended by ATG )  : Estore ( Commerce ) Box --> The box with the estore/site ear (with the site JSPs and Java codes). Search Engine Box --> The box with the search engine application running. Indexing Engine Box --> The box with the indexing engine application running. CA (Content Administration) Box --> The box with the ATG CA ear ( where we could take CA -BCC - Search Administration and configure the search projects) . Search Indexer Box --> The box with the ATG Search Index ear ( to fetch the index data from repository). Note that the engine performing indexing will need access ...

Good features of Eclipse 3.6 (Eclipse Helios) JDT

Read the Eclipse Galileo features @  http://tips4ufromsony.blogspot.com/2011/10/good-features-of-eclipse35-eclipse.html New options in Open Resource dialog : The Open Resource dialog supports three new features: • Path patterns: If the pattern contains a /, the part before the last / is used to match a path in the workspace: • Relative paths: For example, "./T" matches all files starting with T in the folder of the active editor or selection: • Closer items on top: If the pattern matches many files with the same name, the files that are closer to the currently edited or selected resource are shown on top of the matching items list. MarketPlace :  Searching and adding new plugins for Eclipse have always been a challenge. The Eclipse Marketplace makes this much easier – it allows you to not only search a central location of all Eclipse plugins, but also allows you to find the most recent and the most popular plugins. Fix multiple proble...

GC Log Analyzer from IBM

This blog is about the GC( garbage collection) log analyzer from IBM. If you have a GC log and you want to analyze the file,  this IBM tool will help you with some graphical analyzer and some recommendations. You can download it from the following URL : https://www.ibm.com/support/pages/ibm-pattern-modeling-and-analysis-tool-java-garbage-collector-pmat Please find below some screenshots and details that might help you. 1. Screenshot 1: 2. Screenshot 2 : 3. Screenshot 3 : 4. Screenshot 4: