Nutch is an effort to build an
open sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
web search engineA web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...
based on
Lucene JavaApache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....
for the search and index component.
Features
Nutch is coded entirely in the
Java programming languageJava is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "
web crawlerA Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
") has been written from scratch specifically for this project.
History
Nutch originated with
Doug CuttingDouglass Read Cutting is an advocate and creator of open-source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation. He holds a bachelor's degree from Stanford University....
, creator of both
LuceneApache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....
and
HadoopApache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a
MapReduceMapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
facility and a
distributed file systemNetwork file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...
. The two facilities have been spun out into their own subproject, called
HadoopApache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
.
In January, 2005, Nutch joined the
Apache IncubatorApache Incubator is the gateway for Open source projects intended to become fully fledged Apache Software Foundation projects.The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for projects and codebases wishing to become part of the...
, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the
Apache Software FoundationThe Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...
.
Advantages
Some of the advantages of Nutch, when compared to a simple Fetcher
- highly scalable and relatively feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - you can run Nutch on a cluster of 100 machines
- quality - you can bias the crawling to fetch “important” pages first
Scalability
IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the
Power5The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the highly successful POWER4. The principal improvements are support for simultaneous multithreading and an on-die memory controller...
.
The ClueWeb09 dataset (used in e.g.
TRECThe Text REtrieval Conference is an on-going series of workshops focusing on a list of different information retrieval research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology and the Intelligence Advanced Research Projects Activity , and began in 1992...
) was gathered using Nutch, with an average speed of 755.31 documents per second.
Related projects
- Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
- Java framework that supports distributed applications running on large clusters
- nutchWAX - Uses Nutch to search a web archive
- Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.
Search engines built with Nutch
- Creative Commons
Creative Commons is a non-profit organization headquartered in Mountain View, California, United States devoted to expanding the range of creative works available for others to build upon legally and to share. The organization has released several copyright-licenses known as Creative Commons...
Search - launched 2004, Nutch implementation replaced 2006
- DiscoverEd
DiscoverEd is an educational search engine built by Creative Commons....
- Open educational resourcesOpen educational resources are digital materials that can be re-used for teaching, learning, research and more, made available for free through open licenses, which allow uses of the materials that would not be easily permitted under copyright alone...
search prototype developed by Creative Commons
- Krugle
Krugle is a search engine that allows computer programmers and other developers to search Open Source repositories in order to locate open source code, and quickly share the code with other programmers on the internet. It finished its beta phase and went live on June 14th, 2006.The engine...
- mozDex
mozDex is a search engine that is built on FOSS technologies like Nutch. mozDex is focused only on providing an open search engine. Since its search algorithms and code are open it is to ensure that no search results are manipulated by either mozDex as a company or anyone else...
- Wikia Search
Wikia Search was a short-lived free and open-source Web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley....
- launched 2008, closed down 2009
- search2.net
See also
- Faceted Search
- Information Extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
- Enterprise Search
Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.-Enterprise search summary:...
External links