Skip to main content

Apache Lucene Search Engine’s Features


Apache Lucene is a high-performance, full featured text search engine library written entirely in Java. It is part of Apache Jakarta Project. Lucene was originally written by Doug Cutting in Java. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene is Doug Cutting’s wife’s middle name !

Features

1. Scalable, High-Performance Indexing

  • Over 95GB/hour on modern hardware
  • Small RAM requirements — only 1MB heap
  • Incremental indexing as fast as batch indexing
  • Index size roughly 20-30% the size of text indexed


2. Powerful, Accurate and Efficient Search Algorithms

  • Ranked searching — best results returned first
  • Sorting by any field
  • Multiple-index searching with merged results
  • Allows simultaneous update and searching


3. Flexible Queries

  • Phrase queries –>  like “star wars” –> search for the full word star wars.
  • Wildcard queries  –> like star* or  sta?  –> search for a single character or multi character replacements for the search words
  • Fuzzy queries  –> like star~0.8  –> search for the similar words with some weightage
  • Proximity queries  –> like  ”star wars”~10 –> search for a “star” and “wars” within 10 words of each other in a document
  • Range queries  –>  like {star-stun}  –>  search for documents in between star and stun. Exclusive queries are denoted by curly brackets
  • Fielded searching   –>  fields like  title, author, contents
  • Date-range searching   –> like [2006-2007]  –>  search for documents with field value in between 2006 and 2007. Inclusive queries are denoted by square brackets
  • Boolean Operators  –>  like star AND wars . The OR operator is the default conjunction operator.
  • Boosting a Term –>  like star^4  wars –> make documents with term star more relevant
  • + Operator  –>  like +star wars –>  search for documents that must contain “star” and may contain “wars”
  • - Operator  –>  like star -wars –>  search for documents that contain “star” and not contains “wars”
  • Grouping –>  like (star AND wars) OR website –>  using parentheses to group clauses to form sub queries
  • Escape special character –>  The current list special characters are   + – && || ! ( ) { } [ ] ^ ” ~ * ? : \  . To escape these character use the \ before the character.


4. Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible


At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.
Index  --> sequence of documents ( Directory)
Document  -->  sequence of fields
Field  --> named sequence of terms
Term  --> a text string (e.g., a word)
Terms:
A search query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query.

Fields:
When performing a search you can either specify a field, or use the default field. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

Comments

Popular posts from this blog

Google Chrome shortcut keys

If you are a Google Chromey guy, please find below the list of shortcut keys for some of the most used features  :-) Find more shortcut keys @  http://www.google.com/support/chrome/bin/static.py?page=guide.cs&guide=25799&topic=28650

Tax Credit Statement ( Form 26AS )

Read more about Tax filing @  http://tips4ufromsony.blogspot.com/2011/07/income-tax-process-and-e-filing.html  . Form 26AS is a consolidated tax statement issued under Rule 31 AB of Income Tax Rules to PAN holders. This statement, with respect to a financial year, will include details of: Tax deducted at source (TDS). Tax collected at source (TCS). Advance tax/self assessment tax/regular assessment tax etc., deposited in the bank by the taxpayers (PAN holders). Paid refund received during the financial year. Form 26AS will be prepared only with respect to Financial Year 05-06 onwards. To view the Form26AS , log-in to https://incometaxindiaefiling.gov.in and click on ‘View Tax Credit Statement (From 26AS)’  in ‘My Account’. Read more about Form 26AS  @ http://www.incometaxindia.gov.in/26ASTaxCreditStatement.asp http://www.tin-nsdl.com/form26as.asp

ATG - more about Forms and Form Handlers

An ATG form is defined by the dsp:form tag, which typically encloses DSP tags that specify form elements, such as dsp:input that provide direct access to Nucleus component properties. Find below a sample dsp:form tag.    <dsp:form action="/testPages/showPersonProperties.jsp" method="post" target="_top">      <p>Name: <dsp:input bean="/samples/Person.name" type="text"/>      <p>Age: <dsp:input bean="/samples/Person.age" type="text" value="30"/>      <p><dsp:input type="submit" bean="/samples/Person.submit"/> value="Click to submit"/>    </dsp:form>   When the user submits the form, the /samples/Person.name property is set to the value entered in the input field.Unlike standard HTML, which requires the name attribute for most input tags; the name attribute is optional for DSP form element tags. If an input tag omits the n...

JBoss - how to take thread dumps

Following are the different options in JBoss to take the thread dump. 1. Using JBoss jmx-console :    Got to http://localhost:8080/jmx-console  and search for the mbean " serverInfo " and click on that link. Click the  invoke mbean operation under listThreaddump() method. This will give you the current snapshot of threads which are running in your JBoss jvm. 2. By using twiddle :   In the commnad promt go to <JBOSS-HOME>/bin and run the command  twiddle.bat invoke "jboss.system:type=ServerInfo" listThreadDump > threadDump.html  and you could find the threadDump.html with the JBoss thread dump. 3. Using "Interrupt" signal :  Use kill -3 <process-id> to generate thread dump. You will find the thread dumps in server logs.

ATG Search architectural flow : Search and Index

I would like to explain the high level ATG Search implementation architecture ( for an online store) through the above diagram. In this diagram 1.x denotes the search functionality and 2.x denotes the indexing functionality. I have given JBoss as the application server. Physical Boxes and Application Servers in the diagram ( as recommended by ATG )  : Estore ( Commerce ) Box --> The box with the estore/site ear (with the site JSPs and Java codes). Search Engine Box --> The box with the search engine application running. Indexing Engine Box --> The box with the indexing engine application running. CA (Content Administration) Box --> The box with the ATG CA ear ( where we could take CA -BCC - Search Administration and configure the search projects) . Search Indexer Box --> The box with the ATG Search Index ear ( to fetch the index data from repository). Note that the engine performing indexing will need access ...