Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
Nutch

Nutch

Overview
Nutch is an effort to build an open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 web search engine
Web search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...

 based on Lucene Java
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....

 for the search and index component.
Discussion
Ask a question about 'Nutch'
Start a new discussion about 'Nutch'
Answer questions from other users
Full Discussion Forum
 
Unanswered Questions
Recent Discussions
Encyclopedia
Nutch is an effort to build an open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 web search engine
Web search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...

 based on Lucene Java
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....

 for the search and index component.

Features


Nutch is coded entirely in the Java programming language
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

") has been written from scratch specifically for this project.

History


Nutch originated with Doug Cutting
Doug Cutting
Douglass Read Cutting is an advocate and creator of open-source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology projects which are now managed through the Apache Software Foundation. He holds a bachelor's degree from Stanford University....

, creator of both Lucene
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....

 and Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce
MapReduce
MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....

 facility and a distributed file system
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...

. The two facilities have been spun out into their own subproject, called Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

.

In January, 2005, Nutch joined the Apache Incubator
Apache Incubator
Apache Incubator is the gateway for Open source projects intended to become fully fledged Apache Software Foundation projects.The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for projects and codebases wishing to become part of the...

, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation
Apache Software Foundation
The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...

.

Advantages


Some of the advantages of Nutch, when compared to a simple Fetcher
  • highly scalable and relatively feature rich crawler
  • features like politeness which obeys robots.txt rules
  • robust and scalable - you can run Nutch on a cluster of 100 machines
  • quality - you can bias the crawling to fetch “important” pages first

Scalability


IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5
POWER5
The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the highly successful POWER4. The principal improvements are support for simultaneous multithreading and an on-die memory controller...

.

The ClueWeb09 dataset (used in e.g. TREC
Text Retrieval Conference
The Text REtrieval Conference is an on-going series of workshops focusing on a list of different information retrieval research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology and the Intelligence Advanced Research Projects Activity , and began in 1992...

) was gathered using Nutch, with an average speed of 755.31 documents per second.

Related projects

  • Hadoop
    Hadoop
    Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

     - Java framework that supports distributed applications running on large clusters
  • nutchWAX - Uses Nutch to search a web archive
  • Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.

Search engines built with Nutch

  • Creative Commons
    Creative Commons
    Creative Commons is a non-profit organization headquartered in Mountain View, California, United States devoted to expanding the range of creative works available for others to build upon legally and to share. The organization has released several copyright-licenses known as Creative Commons...

     Search - launched 2004, Nutch implementation replaced 2006
  • DiscoverEd
    DiscoverEd
    DiscoverEd is an educational search engine built by Creative Commons....

     - Open educational resources
    Open educational resources
    Open educational resources are digital materials that can be re-used for teaching, learning, research and more, made available for free through open licenses, which allow uses of the materials that would not be easily permitted under copyright alone...

     search prototype developed by Creative Commons
  • Krugle
    Krugle
    Krugle is a search engine that allows computer programmers and other developers to search Open Source repositories in order to locate open source code, and quickly share the code with other programmers on the internet. It finished its beta phase and went live on June 14th, 2006.The engine...

  • mozDex
    MozDex
    mozDex is a search engine that is built on FOSS technologies like Nutch. mozDex is focused only on providing an open search engine. Since its search algorithms and code are open it is to ensure that no search results are manipulated by either mozDex as a company or anyone else...

  • Wikia Search
    Wikia Search
    Wikia Search was a short-lived free and open-source Web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley....

     - launched 2008, closed down 2009
  • search2.net

See also


  • Faceted Search
  • Information Extraction
    Information extraction
    Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

  • Enterprise Search
    Enterprise search
    Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.-Enterprise search summary:...


External links