Skip to main content

Apache Lucene Search Engine’s Features


Apache Lucene is a high-performance, full featured text search engine library written entirely in Java. It is part of Apache Jakarta Project. Lucene was originally written by Doug Cutting in Java. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene is Doug Cutting’s wife’s middle name !

Features

1. Scalable, High-Performance Indexing

  • Over 95GB/hour on modern hardware
  • Small RAM requirements — only 1MB heap
  • Incremental indexing as fast as batch indexing
  • Index size roughly 20-30% the size of text indexed


2. Powerful, Accurate and Efficient Search Algorithms

  • Ranked searching — best results returned first
  • Sorting by any field
  • Multiple-index searching with merged results
  • Allows simultaneous update and searching


3. Flexible Queries

  • Phrase queries –>  like “star wars” –> search for the full word star wars.
  • Wildcard queries  –> like star* or  sta?  –> search for a single character or multi character replacements for the search words
  • Fuzzy queries  –> like star~0.8  –> search for the similar words with some weightage
  • Proximity queries  –> like  ”star wars”~10 –> search for a “star” and “wars” within 10 words of each other in a document
  • Range queries  –>  like {star-stun}  –>  search for documents in between star and stun. Exclusive queries are denoted by curly brackets
  • Fielded searching   –>  fields like  title, author, contents
  • Date-range searching   –> like [2006-2007]  –>  search for documents with field value in between 2006 and 2007. Inclusive queries are denoted by square brackets
  • Boolean Operators  –>  like star AND wars . The OR operator is the default conjunction operator.
  • Boosting a Term –>  like star^4  wars –> make documents with term star more relevant
  • + Operator  –>  like +star wars –>  search for documents that must contain “star” and may contain “wars”
  • - Operator  –>  like star -wars –>  search for documents that contain “star” and not contains “wars”
  • Grouping –>  like (star AND wars) OR website –>  using parentheses to group clauses to form sub queries
  • Escape special character –>  The current list special characters are   + – && || ! ( ) { } [ ] ^ ” ~ * ? : \  . To escape these character use the \ before the character.


4. Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible


At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.
Index  --> sequence of documents ( Directory)
Document  -->  sequence of fields
Field  --> named sequence of terms
Term  --> a text string (e.g., a word)
Terms:
A search query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query.

Fields:
When performing a search you can either specify a field, or use the default field. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

Comments

Popular posts from this blog

ATG Search Indexing - behind the scene steps explained

Read more about the search indexing @  http://tips4ufromsony.blogspot.com/2011/11/atg-search-architectural-flow-search.html ATG search indexing involves index file creation, deploying and copying the index file to the search engine's box. The steps can be divided into Initial stage, Preparing Content, Indexing and Deploying. Please find below the detailed analysis of each step. 1. Initial stage:        a. Check whether the folder deployshare configured correctly @ LaunchingService.deployShare  ( \atg\search\routing\LaunchingService.deployShare ). Lets assume that it is configured to \Search2007.1\SearchEngine\i686-win32-vc71\buildedIndexFiles.        b. Lets assume that the index file folder ( \Search2007.1\SearchEngine\i686-win32-vc71\indexFiles)  has the following segments (folders) currently :                     66900...

ATG Search - search engine tuning settings

In this blog, I am going to list the best tuning settings for ATG Search engine. The AESoapConfig.xml, AESoapWaspConfig.xml  and AEConfig.xml are the xmls referred below and you can find it @  <ATG_DIR>\<Searchx.x>\SearchEngine\<operating_system>\bin\ folder. (1)  Make sure that the AESoapConfig.xml's rwTimeout is less than or equal to routing's readTimeoutMs. You could find the routing's readTimeoutMs @ atg\search\routing\SearchEngineService component.               rwTimeout is the  length of time in seconds to wait before a read or write operation times out on an active connection. The number can be decreased to improve performance. However, a value that is too low could cause slow connections to be prematurely closed. (2)  Adjust the number of engine threads to match the number of CPUs available to the engine. Note that the minimal value for maxThreads and maxSpar...

ATG Search - how to define the search configuration rules

ATG Search configuration rules are specified through the ATG BCC Merchandising UI. Over here you could specify the below set of rules Redirection rules --> If you want to redirect to another page other than the search result page for a search keyword Property Prioritization rules  --> Prioritize certain set of properties and give weightage Result exclusion rules  --> Exclude certain search results Result positioning rules  --> Position / Sort the search result data 1.  To create the search configuration tree, you need to log-in to the ATG BCC Merchandising UI and select the Search Configuration Tree in the Browse tab drop down. Now need to click the Create button and first you need to create the Search Configuration Folder. 2.  When you create the folder, give the name of the folder and select whether the contents vary by Language or Segment. If your ecommerce site need to support more than one language and y...

ATG - quick reference to commonly used DSP Tags

In this blog, I would like to give a quick reference to the most commonly used DSP Tags.Note that in this DSP tag details : bean refers to a Nucleus path, component name, and property name param refers to a Page parameter value refers to a Static-value var refers to a EL variable id refers to a scripting variable ============================================================== 1.dsp:importbean     example: <dsp:importbean bean="/atg/dynamo/droplet/Switch"/> ============================================================== 2.dsp:page     usage: It encloses a JSP. The dsp:page invokes the JSP handler, which calls the servlet pipeline and generates HTTPServletRequest.    example:    <dsp:page> ..... </dsp:page> ============================================================== 3.dsp:include     usage: Embeds a page fragment in a JSP.     example:   <dsp:include src="/myPage/Result...

Eclipse plug-in to create Class and Sequence diagrams

ModelGoon is an Eclipse plug-in avaiable for UML diagram generation from Java code. It can be used to generate Package Dependencies Diagram, Class Diagram, Interaction Diagram and Sequence Diagram. You coud get it from http://marketplace.eclipse.org/content/modelgoon-uml4java Read more about it and see some vedios about how to create the class and sequence diagram @ http://www.modelgoon.org/?tag=eclipse-plugin Find some snapshots below which gives an idea about the diagram generation.