Apache Lucene Search Engine’s Features

Apache Lucene is a high-performance, full featured text search engine library written entirely in Java. It is part of Apache Jakarta Project. Lucene was originally written by Doug Cutting in Java. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene is Doug Cutting’s wife’s middle name !

Features

1. Scalable, High-Performance Indexing

Over 95GB/hour on modern hardware
Small RAM requirements — only 1MB heap
Incremental indexing as fast as batch indexing
Index size roughly 20-30% the size of text indexed

2. Powerful, Accurate and Efficient Search Algorithms

Ranked searching — best results returned first
Sorting by any field
Multiple-index searching with merged results
Allows simultaneous update and searching

3. Flexible Queries

Phrase queries –> like “star wars” –> search for the full word star wars.
Wildcard queries –> like star* or sta? –> search for a single character or multi character replacements for the search words
Fuzzy queries –> like star~0.8 –> search for the similar words with some weightage
Proximity queries –> like ”star wars”~10 –> search for a “star” and “wars” within 10 words of each other in a document
Range queries –> like {star-stun} –> search for documents in between star and stun. Exclusive queries are denoted by curly brackets
Fielded searching –> fields like title, author, contents
Date-range searching –> like [2006-2007] –> search for documents with field value in between 2006 and 2007. Inclusive queries are denoted by square brackets
Boolean Operators –> like star AND wars . The OR operator is the default conjunction operator.
Boosting a Term –> like star^4 wars –> make documents with term star more relevant
+ Operator –> like +star wars –> search for documents that must contain “star” and may contain “wars”
- Operator –> like star -wars –> search for documents that contain “star” and not contains “wars”
Grouping –> like (star AND wars) OR website –> using parentheses to group clauses to form sub queries
Escape special character –> The current list special characters are + – && || ! ( ) { } [ ] ^ ” ~ * ? : \ . To escape these character use the \ before the character.

4. Cross-Platform Solution

Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
100%-pure Java
Implementations in other programming languages available that are index-compatible

At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.

Index --> sequence of documents ( Directory)

Document --> sequence of fields

Field --> named sequence of terms

Term --> a text string (e.g., a word)

Terms:

A search query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query.

Fields:

When performing a search you can either specify a field, or use the default field. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

Tips from Sony Thomas

Search This Blog

Apache Lucene Search Engine’s Features

Labels

Comments

Post a Comment

Popular posts from this blog

Eclipse plug-in to create Class and Sequence diagrams

ATG - more about Forms and Form Handlers

ATG - quick reference to commonly used DSP Tags

ATG Search architectural flow : Search and Index

How to convert your Blogger Blog to PDF ?