Skip to main content

Apache Lucene Search Engine’s Features


Apache Lucene is a high-performance, full featured text search engine library written entirely in Java. It is part of Apache Jakarta Project. Lucene was originally written by Doug Cutting in Java. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene is Doug Cutting’s wife’s middle name !

Features

1. Scalable, High-Performance Indexing

  • Over 95GB/hour on modern hardware
  • Small RAM requirements — only 1MB heap
  • Incremental indexing as fast as batch indexing
  • Index size roughly 20-30% the size of text indexed


2. Powerful, Accurate and Efficient Search Algorithms

  • Ranked searching — best results returned first
  • Sorting by any field
  • Multiple-index searching with merged results
  • Allows simultaneous update and searching


3. Flexible Queries

  • Phrase queries –>  like “star wars” –> search for the full word star wars.
  • Wildcard queries  –> like star* or  sta?  –> search for a single character or multi character replacements for the search words
  • Fuzzy queries  –> like star~0.8  –> search for the similar words with some weightage
  • Proximity queries  –> like  ”star wars”~10 –> search for a “star” and “wars” within 10 words of each other in a document
  • Range queries  –>  like {star-stun}  –>  search for documents in between star and stun. Exclusive queries are denoted by curly brackets
  • Fielded searching   –>  fields like  title, author, contents
  • Date-range searching   –> like [2006-2007]  –>  search for documents with field value in between 2006 and 2007. Inclusive queries are denoted by square brackets
  • Boolean Operators  –>  like star AND wars . The OR operator is the default conjunction operator.
  • Boosting a Term –>  like star^4  wars –> make documents with term star more relevant
  • + Operator  –>  like +star wars –>  search for documents that must contain “star” and may contain “wars”
  • - Operator  –>  like star -wars –>  search for documents that contain “star” and not contains “wars”
  • Grouping –>  like (star AND wars) OR website –>  using parentheses to group clauses to form sub queries
  • Escape special character –>  The current list special characters are   + – && || ! ( ) { } [ ] ^ ” ~ * ? : \  . To escape these character use the \ before the character.


4. Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible


At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.
Index  --> sequence of documents ( Directory)
Document  -->  sequence of fields
Field  --> named sequence of terms
Term  --> a text string (e.g., a word)
Terms:
A search query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query.

Fields:
When performing a search you can either specify a field, or use the default field. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

Comments

Popular posts from this blog

Google Chrome shortcut keys

If you are a Google Chromey guy, please find below the list of shortcut keys for some of the most used features  :-) Find more shortcut keys @  http://www.google.com/support/chrome/bin/static.py?page=guide.cs&guide=25799&topic=28650

ATG Search Indexing - overview of different steps in search indexing

Read more about the search indexing behind the scene steps @  http://tips4ufromsony.blogspot.in/2011/12/atg-search-indexing-behind-scene-steps.html ATG Search prepares searchable content by indexing the products specified in the XML definition file (/atg/commerce/search/ProductCatalogOutputConfig). Generally there are two types of indexing 1.  Full Indexing  --> all data taken for indexing 2.  Incremental Indexing --> only changed data will be taken for indexing When full indexing is triggered, following happens:    1. The out of box component BulkLoader will call IndexedItemsGroup.getGroupMembers() to load the products to the XHTL document. It prevents uncategorized products from getting indexed. The definition file format begins with a top-level item as a product and includes the properties of parent category and childskus. For each product, the set of Variant Producers configured in ProductCatalogOutputConfig is execute...

ATG CA - BCC home screen : how to add a new link

          Activity source is the property which controls the links on the left nav on the BCC home screen. All activity sources are registered with the ActivityManager component at /atg/bizui/activity/ActivityManager . When rendering the BCC home page, the ActivityManager cycles through all the registered ActivitySource components and displays left navigation links for each of them on the BCC home page. For example if I want to add a new link "My New Link" , below screen shots exaplins how this can be done 1. Add  activityManager.properties to specify the activityresources. In this  activityManager, I specified one MyActivitySource. 2. Add  MyActivitySource.properties  to specify the name of the link and the other details . Here it refers to a bundle properties file.  3. Add  the bundle properties file  to specify the name of the link.  4. Now you could see the new link...

ATG Search - how to create a search project

Here I am going to explain how we can create a new ATG search project. It involves 3 steps --> Specify the general search project settings, Specify the content of search indexing and Build the index. Below I am elaborating the different steps involved with screen shots : 1. Go to Search Project Administration ui @  BCC and Click the button "New Search Project" to create a new search project. 2. Specify the search project name, give description and click the button "Create Search Project". 3. Click the button "Add Content" to add the search project content. 4. Specify the content name, select the content type and specify the IndexingOutputConfig path if the content type is ATG repository. Specify the remote host and port if you are using another server for fetching the content. 5. Click the content in the left side and expand the advanced option to specify the language and other customizations. 6. Click the ...

ATG - basic concepts of ATG

This blog is for the ATG beginners to get some basic overview about ATG. I just given the ATG concepts as a list of numbered points for the ease of understanding. 1. At the framework level, ATG is a               java based application platform for hosting web-based applications, as well as RMI accessible business components,               with an ORM layer,               a component container,               an MVC framework,               and a set of tag libraries for JSP. 2. Art Technology Group(ATG)'s Dynamo Application Server (DAS) is a Java EE compliant application server. DAS is no longer actively developed as ATG recommends using other Java EE applications servers for its products such as BEA WebLogic, JBoss or IBM WebSphere. 3. Prior to ATG 2007, JHTML was used instead of JSP for view purpose. J...