Skip to main content

Apache Lucene Search Engine’s Features


Apache Lucene is a high-performance, full featured text search engine library written entirely in Java. It is part of Apache Jakarta Project. Lucene was originally written by Doug Cutting in Java. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene is Doug Cutting’s wife’s middle name !

Features

1. Scalable, High-Performance Indexing

  • Over 95GB/hour on modern hardware
  • Small RAM requirements — only 1MB heap
  • Incremental indexing as fast as batch indexing
  • Index size roughly 20-30% the size of text indexed


2. Powerful, Accurate and Efficient Search Algorithms

  • Ranked searching — best results returned first
  • Sorting by any field
  • Multiple-index searching with merged results
  • Allows simultaneous update and searching


3. Flexible Queries

  • Phrase queries –>  like “star wars” –> search for the full word star wars.
  • Wildcard queries  –> like star* or  sta?  –> search for a single character or multi character replacements for the search words
  • Fuzzy queries  –> like star~0.8  –> search for the similar words with some weightage
  • Proximity queries  –> like  ”star wars”~10 –> search for a “star” and “wars” within 10 words of each other in a document
  • Range queries  –>  like {star-stun}  –>  search for documents in between star and stun. Exclusive queries are denoted by curly brackets
  • Fielded searching   –>  fields like  title, author, contents
  • Date-range searching   –> like [2006-2007]  –>  search for documents with field value in between 2006 and 2007. Inclusive queries are denoted by square brackets
  • Boolean Operators  –>  like star AND wars . The OR operator is the default conjunction operator.
  • Boosting a Term –>  like star^4  wars –> make documents with term star more relevant
  • + Operator  –>  like +star wars –>  search for documents that must contain “star” and may contain “wars”
  • - Operator  –>  like star -wars –>  search for documents that contain “star” and not contains “wars”
  • Grouping –>  like (star AND wars) OR website –>  using parentheses to group clauses to form sub queries
  • Escape special character –>  The current list special characters are   + – && || ! ( ) { } [ ] ^ ” ~ * ? : \  . To escape these character use the \ before the character.


4. Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible


At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.
Index  --> sequence of documents ( Directory)
Document  -->  sequence of fields
Field  --> named sequence of terms
Term  --> a text string (e.g., a word)
Terms:
A search query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query.

Fields:
When performing a search you can either specify a field, or use the default field. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

Comments

Popular posts from this blog

ATG Search and Search engine activity log

We could use the SearchEngineActivity log files to get the request/response to the search engine from a commerce instance. This folder is located in each commerce instance or the instances from which the call to the search engine is done. The SearchEngineActivity log file folder can be configured @ SearchEngineService component ( /dyn/admin/nucleus/atg/search/routing/SearchEngineService). To get the log files for the search engine calls, you need to specify the SearchEngineService.dumpingRequests as true. Then you need to specify the engineActivityPath as the folder in which you need the SearchEngineActivity logs. Below you could find my SearchEngineActivity log folder. Each successful call to the search engine from the commerce instance will create 5 files in the SearchEngineActivity folder : namely  request, response, search engineinfo, stack trace and response row . Each file name start with a specific sequence. You ...

ATG Product Catalog schema ER diagram

Check out the O rder schema ER-Diagram @   http://tips4ufromsony.blogspot.in/2012/02/atg-order-schema-er-diagram.html Check out the User Profile  schema ER-Diagram @ http://tips4ufromsony.blogspot.in/2012/03/atg-user-profile-schema-er-diagram.html If you would like to know the relationship between different Product Catalog tables, please find below screen shots of  Product Catalog schema ER Diagrams.

Search engine shutdown call from the Estore instnace

 When the estore or commerce instances are restarted the routing system service can send shut down requests to stale engines, this can be caused because of any of the following. 1. Some other instance marked the engine as stopped in the DB (This can be caused because the machine could not reach the host running the engine) 2. The eStore instance is using a different search schema that has data about the search engine host and is marked as stopped in that DB. To avoid this overrride \atg\search\routing\RoutingSystemService.properties by setting cleanUpStrayEng=false.

Jsp and CSS size limits that web developers need to aware

Here I am listing some erroneous cases that might occur in your web development phase, due to some size restrictions. JSP file size limit : You might get some run time exceptions that the JSP file size limit exceeds. Please find below the reason : In JVM the size of a single JAVA method is limited to 64kb. When the jsp file is converted to Servlet, if the jspservice method's size exceeds the 64kb limit, this exception will occur. Keep in mind that this exception depends on the implementation of the JSP translator, means the same JSP code may give an exception in Tomcat and may run successfully in Weblogic due to the the difference in the logic to built the Servlet methods from JSP. The best way to omit this issue is by using dynamic include.For example, if you are using                  <%@ include file="sample.jsp" %> (static include),  replace this to               ...

JBoss - know more about the JBoss directory structure

Fundamentally, the JBoss architecture consists of the JMX MBean server, the microkernel, and a set of pluggable component services - the MBeans. The JBoss Application Server ships with three different server configurations. Within the <JBoss_Home>/server directory, you will find three subdirectories: minimal, default and all. The default configuration is the one used if you don’t specify another one when starting up the server. If you want to know which services are configured in each of these instances, look at the jboss-service.xml file in the <JBoss_Home>/server/<instance-name>/conf/ directory and also the configuration files in the <JBoss_Home>/server/<instance-name>/deploy directory. JBoss 4.0 features an embedded Apache Tomcat 5.5 servlet container. conf --> The conf directory contains the jboss-service.xml bootstrap descriptor file for a given server configuration. This has the jboss-log4j.xml file which configures the Apach...