Lucene
Encyclopedia
Apache Lucene is a free
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

/open source information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 software library, originally created in Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

 by Doug Cutting
Doug Cutting
Douglass Read Cutting is an advocate and creator of open-source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation. He holds a bachelor's degree from Stanford University....

. It is supported by the Apache Software Foundation
Apache Software Foundation
The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...

 and is released under the Apache Software License.

Lucene has been ported to other programming languages including Delphi
Object Pascal
Object Pascal refers to a branch of object-oriented derivatives of Pascal, mostly known as the primary programming language of Embarcadero Delphi.-Early history at Apple:...

, Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

, C#, C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

, Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

 and PHP
PHP
PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

.

History

Lucene was originally written by Doug Cutting
Doug Cutting
Douglass Read Cutting is an advocate and creator of open-source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation. He holds a bachelor's degree from Stanford University....

. It was initially available for download from its home at the SourceForge
SourceForge
SourceForge Enterprise Edition is a collaborative revision control and software development management system. It provides a front-end to a range of software development lifecycle services and integrates with a number of free software / open source software applications .While originally itself...

 web site. It joined the Apache Software Foundation’s Jakarta
Jakarta Project
The Jakarta Project creates and maintains open source software for the Java platform. It operates as an umbrella project under the auspices of the Apache Software Foundation, and all of Jakarta products are released under the Apache License.-Subprojects:...

 family of open source Java products in September 2001 and became its own top-level Apache project in February 2005. Until recently, it included a number of sub-projects, such as Lucene Java, Droids, Lucene.Net
Lucene.net
Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET Framework users. It is licensed under the Apache License 2.0 license.-External links:* *...

, Lucy, Mahout
Apache Mahout
Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform...

, Solr
Solr
Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling...

, Nutch
Nutch
Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.- Features :Nutch is coded entirely in the Java programming language, but data is written in language-independent formats...

, Open Relevance Project, PyLucene and Tika. Solr has been merged into the Lucene project itself and Mahout, Nutch and Tika have been moved to be independent top-level projects.

Features and common use

While suitable for any application which requires full text indexing
Index (search engine)
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science...

 and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching.

At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...

. Text from PDFs
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....

, HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

, Microsoft Word
Microsoft Word
Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...

, and OpenDocument
OpenDocument
The Open Document Format for Office Applications is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents....

 documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.

Lucene-based projects

Lucene itself is just an indexing and search library and does not contain crawling and HTML parsing functionality. However, several projects extend Lucene's capability:
  • Apache Nutch
    Nutch
    Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.- Features :Nutch is coded entirely in the Java programming language, but data is written in language-independent formats...

     provides web crawling and HTML parsing
  • Apache Solr
    Solr
    Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling...

     – an enterprise search server
  • ElasticSearch
    ElasticSearch
    ElasticSearch is a distributed, RESTful, free/open source search server based on Apache Lucene. It is developed by Shay Banon and is released under the Apache Software License. ElasticSearch is developed in Java.- History :...

      – an enterprise search server.
  • Compass
    Compass Project
    Compass is a free/open source Java Search Engine Framework built on top of Lucene created by Shay Banon.Compass provides a simple API for working with Lucene...

     – a Java Search Engine Framework

Users

For a list of companies that use Lucene (rather than extend), see Lucene's PoweredBy page. As an example Twitter is using Lucene for its real time search

See also

  • Lucene.net
    Lucene.net
    Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET Framework users. It is licensed under the Apache License 2.0 license.-External links:* *...

  • Hadoop
    Hadoop
    Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

  • Hibernate search
  • Xapian
    Xapian
    Xapian is an open source probabilistic information retrieval library, released under the GNU General Public License . It is a full text search engine library for programmers....

  • Sphinx (search engine)
    Sphinx (search engine)
    Sphinx is a free software search engine designed with indexing database content in mind. It currently supports MySQL, PostgreSQL, and ODBC-compliant databases as data sources natively. Other data sources can be indexed via pipe in a custom XML format...

  • LGTE
    LGTE
    Lucene Geographic and Temporal is an information retrieval tool developed at Technical University of Lisbon which can be used as a search engine or as evaluation system for information retrieval techniques for research purposes...

  • Information extraction
    Information extraction
    Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

  • Text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK