MapReduce
Encyclopedia
MapReduce is a software framework
Software framework
In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by user code, thus providing application specific software...

 introduced by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 in 2004 to support distributed computing
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

 on large data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

s on clusters
Cluster (computing)
A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks...

 of computers. Parts of the framework are patented
Software patent
Software patent does not have a universally accepted definition. One definition suggested by the Foundation for a Free Information Infrastructure is that a software patent is a "patent on any performance of a computer realised by means of a computer program".In 2005, the European Patent Office...

 in some countries.

The framework is inspired by the map
Map (higher-order function)
In many programming languages, map is the name of a higher-order function that applies a given function to each element of a list, returning a list of results. They are examples of both catamorphisms and anamorphisms...

 and reduce
Fold (higher-order function)
In functional programming, fold – also known variously as reduce, accumulate, compress, or inject – are a family of higher-order functions that analyze a recursive data structure and recombine through use of a given combining operation the results of recursively processing its...

 functions commonly used in functional programming
Functional programming
In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state...

, although their purpose in the MapReduce framework is not the same as their original forms.

MapReduce libraries have been written in C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

, C#, Erlang, Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

, LabVIEW
LabVIEW
LabVIEW is a system design platform and development environment for a visual programming language from National Instruments. LabVIEW provides engineers and scientists with the tools needed to create and deploy measurement and control systems.The graphical language is named "G"...

, OCaml
Objective Caml
OCaml , originally known as Objective Caml, is the main implementation of the Caml programming language, created by Xavier Leroy, Jérôme Vouillon, Damien Doligez, Didier Rémy and others in 1996...

, Perl, Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

, PHP, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

, F#, R
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

 and other programming languages.

Overview

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

 (structured).

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree
Tree (data structure)
In computer science, a tree is a widely-used data structure that emulates a hierarchical tree structure with a set of linked nodes.Mathematically, it is an ordered directed tree, more specifically an arborescence: an acyclic connected graph where each node has zero or more children nodes and at...

 structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm
Server farm
A server farm or server cluster is a collection of computer servers usually maintained by an enterprise to accomplish server needs far beyond the capability of one machine. Server farms often have backup servers, which can take over the function of primary servers in the event of a primary server...

 can use MapReduce to sort a petabyte
Petabyte
A petabyte is a unit of information equal to one quadrillion bytes, or 1000 terabytes. The unit symbol for the petabyte is PB...

 of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.

Logical view

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain
Data domain
In data management and database analysis, a data domain refers to all the unique values which a data element may contain. The rule for determining the domain boundary may be as simple as a data type with an enumerated list of values....

, and returns a list of pairs in a different domain:

Map(k1,v1)list(k2,v2)

The Map function is applied in parallel to every item in the input dataset. This produces a list of (k2,v2) pairs for each call.
After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys.

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

Reduce(k2, list (v2))list(v3)

Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.

Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the typical functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.

It is necessary but not sufficient to have implementations of the map and reduce abstractions in order to implement MapReduce. Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases. This may be a distributed file system
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...

. Other options are possible, such as direct streaming from mappers to reducers, or for the mapping processors to serve up their results to reducers that query them.

Example

The canonical example application of MapReduce is a process to count the appearances of each different word in a set of documents:

void map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
EmitIntermediate(w, "1");

void reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
int sum = 0;
for each pc in partialCounts:
sum += ParseInt(pc);
Emit(word, AsString(sum));


Here, each document is split into words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word.

Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:
  • an input reader
  • a Map function
  • a partition function
  • a compare function
  • a Reduce function
  • an output writer

Input reader

The input reader divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...

) and generates key/value pairs.

A common example will read a directory full of text files and return each line as a record.

Map function

Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.

If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and "1" as the value.

Partition function

Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reduce.

A typical default is to hash
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

 the key and modulo
Modulo operation
In computing, the modulo operation finds the remainder of division of one number by another.Given two positive numbers, and , a modulo n can be thought of as the remainder, on division of a by n...

 the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing
Load balancing
Load balancing or load distribution may refer to:*Load balancing , balancing a workload amongst multiple computer devices*Load balancing , the storing of excess electrical power by power stations during low demand periods, for release as demand rises*Weight distribution, the apportioning of weight...

 purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish.

Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.

Comparison function

The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function.

Reduce function

The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and output 0 or more values.

In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.

Output writer

The Output Writer writes the output of the Reduce to stable storage, usually a distributed file system
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...

.

Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System
Google File System
Google File System is a proprietary distributed file system developed by Google Inc. for its own use. It is designed to provide efficient, reliable access to data using large clusters of commodity hardware...

) records the node as dead and sends out the node's assigned work to other nodes. Individual operations use atomic
Atomicity
In database systems, atomicity is one of the ACID transaction properties. In an atomic transaction, a series of database operations either all occur, or nothing occurs...

 operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running. When files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way. Because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on. This property is desirable as it conserves bandwidth across the backbone network of the datacenter.

Implementations are not necessarily highly-reliable. For example, in Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

 the NameNode is a single point of failure
Single point of failure
A single point of failure is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.-Overview:Systems can be made...

 for the distributed filesystem.

Uses

MapReduce is useful in a wide range of applications including: distributed grep
Grep
grep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...

, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index
Inverted index
In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents...

 construction, document clustering, machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

, and statistical machine translation
Statistical machine translation
Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora...

. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments.

At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

. It replaced the old ad hoc programs that updated the index and ran the various analyses.

MapReduce's stable inputs and outputs are usually stored in a distributed file system
Distributed file system
Network file system may refer to:* A distributed file system, which is accessed over a computer network* Network File System , a specific brand of distributed file system...

. The transient data is usually stored on local disk and fetched remotely by the reducers.

Criticism

David DeWitt
David DeWitt
David J. DeWitt is the John P. Morgridge Professor of at the University of Wisconsin–Madison. Professor DeWitt received a B.A. degree from Colgate University in 1970, and a Ph.D. from the University of Michigan in 1976...

 and Michael Stonebraker
Michael Stonebraker
Michael Ralph Stonebraker is a computer scientist specializing in database research.Through a series of academic prototypes and commercial startups, Stonebraker's research and products are central to many relational database systems on the market today...

, experts in parallel database
Parallel database
A parallel database system seeks to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion, the distribution is governed solely by performance considerations...

s and shared-nothing architectures, have been critical of the breadth of problems that MapReduce can be used for. They called its interface too low-level and questioned whether it really represents the paradigm shift
Paradigm shift
A Paradigm shift is, according to Thomas Kuhn in his influential book The Structure of Scientific Revolutions , a change in the basic assumptions, or paradigms, within the ruling theory of science...

 its proponents have claimed it is. They challenged the MapReduce proponents' claims of novelty, citing Teradata
Teradata
Teradata Corporation is a vendor specializing in data warehousing and analytic applications. Its products are commonly used by companies to manage data warehouses for analytics and business intelligence purposes. Teradata was formerly a division of NCR Corporation, with the spinoff from NCR on...

 as an example of prior art
Prior art
Prior art , in most systems of patent law, constitutes all information that has been made available to the public in any form before a given date that might be relevant to a patent's claims of originality...

 that has existed for over two decades. They also compared MapReduce programmers to Codasyl
CODASYL
CODASYL is an acronym for "Conference on Data Systems Languages". This was a consortium formed in 1959 to guide the development of a standard programming language that could be used on many computers...

 programmers, noting both are "writing in a low-level language
Low-level programming language
In computer science, a low-level programming language is a programming language that provides little or no abstraction from a computer's instruction set architecture. Generally this refers to either machine code or assembly language...

 performing low-level record manipulation." MapReduce's use of input files and lack of schema
Logical schema
A Logical Schema is a data model of a specific problem domain expressed in terms of a particular data management technology. Without being specific to a particular database management product, it is in terms of either relational tables and columns, object-oriented classes, or XML tags...

 support prevents the performance improvements enabled by common database system features such as B-tree
B-tree
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...

s and hash partitioning
Partition (database)
A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons....

, though projects such as Pig (or PigLatin)
Pig (programming language)
Pigis a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS...

, Sawzall
Sawzall (programming language)
Sawzall is a procedural domain-specific programming language, used by Google to process large numbers of individual log records. Sawzall was first described in 2003, and the szl runtime was open-sourced in August 2010...

, Apache Hive
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix...

, HBase
HBase
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS , providing BigTable-like capabilities for Hadoop...

 and BigTable
BigTable
BigTable is a compressed, high performance, and proprietary database system built on Google File System , Chubby Lock Service, SSTable and a few other Google technologies; it is currently not distributed nor is it used outside of Google, although Google offers access to it as part of their Google...

 are addressing some of these problems.

Another article, by Greg Jorgensen, rejects these views. Jorgensen asserts that DeWitt and Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be used as a database.

DeWitt and Stonebraker have subsequently published a detailed benchmark study in 2009 comparing performance of Hadoop's
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

 MapReduce and RDBMS approaches on several specific problems. They concluded that databases offer real advantages for many kinds of data use, especially on complex processing or where the data is used across an enterprise, but that MapReduce may be easier for users to adopt for simple or one-time processing tasks. They have published the data and code used in their study to allow other researchers to do comparable studies.

Google has been granted a patent on MapReduce. However, there have been claims that this patent should not have been granted because MapReduce is too similar to existing products. For example, map and reduce functionality can be very easily implemented in Oracle's
Oracle database
The Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....

 PL/SQL
PL/SQL
PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...

 database oriented language.

Conferences and users groups


See also

  • Hadoop
    Hadoop
    Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...

    , Apache
    Apache Software Foundation
    The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...

    's free and open source implementation of MapReduce.

External links

Papers
  • "A Hierarchical Framework for Cross-Domain MapReduce Execution" — paper by Yuan Luo, Zhenhua Guo, Yiming Sun, Beth Plale, Judy Qiu; from Indiana University
    Indiana University
    Indiana University is a multi-campus public university system in the state of Indiana, United States. Indiana University has a combined student body of more than 100,000 students, including approximately 42,000 students enrolled at the Indiana University Bloomington campus and approximately 37,000...

     and Wilfred Li; from University of California, San Diego
    University of California, San Diego
    The University of California, San Diego, commonly known as UCSD or UC San Diego, is a public research university located in the La Jolla neighborhood of San Diego, California, United States...

  • "Interpreting the Data: Parallel Analysis with Sawzall" — paper by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan; from Google Labs
    Google Labs
    Google Labs was a page created by Google to demonstrate and test new Google projects. Google calls Google Labs,Google also uses an invitation-only phase for trusted testers to test projects including Gmail, Google Calendar and Google Wave and many of these have their own "labs" webpages for...

  • "Evaluating MapReduce for Multi-core and Multiprocessor Systems" — paper by Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis; from Stanford University
    Stanford University
    The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

  • "Why MapReduce Matters to SQL Data Warehousing" — analysis related to the August, 2008 introduction of MapReduce/SQL integration by Aster Data Systems
    Aster Data Systems
    Aster Data Systems is a data management and analysis software company headquartered in San Carlos, California. It was founded in 2005 and acquired in 2011.-Products:...

     and Greenplum
    Greenplum
    Greenplum is a database software company in San Mateo, California, specializing in enterprise data cloud solutions for large-scale data warehousing and analytics...

  • "MapReduce for the Cell B.E. Architecture" — paper by Marc de Kruijf and Karthikeyan Sankaralingam; from University of Wisconsin–Madison
    University of Wisconsin–Madison
    The University of Wisconsin–Madison is a public research university located in Madison, Wisconsin, United States. Founded in 1848, UW–Madison is the flagship campus of the University of Wisconsin System. It became a land-grant institution in 1866...

  • "Mars: A MapReduce Framework on Graphics Processors" — paper by Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, Tuyong Wang; from Hong Kong University of Science and Technology
    Hong Kong University of Science and Technology
    The Hong Kong University of Science and Technology is a public university located in Hong Kong. Established in 1991 under Hong Kong Law Chapter 1141 , it is one of the nine universities in Hong Kong.Professor Tony F. Chan is the president of HKUST...

    ; published in Proc. PACT 2008. It presents the design and implementation of MapReduce on graphics processors.
  • "A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments" — paper by Fabrizio Marozzo, Domenico Talia, Paolo Trunfio; from University of Calabria
    University of Calabria
    The University of Calabria is a state-run university in Italy.Located in Arcavacata di Rende, a suburb of Cosenza, the university was founded in 1972...

    ; published in Cloud Computing: Principles, Systems and Applications, N. Antonopoulos, L. Gillam (Editors), chapt. 7, pp. 113–125, Springer, 2010, ISBN: 978-1-84996-240-7.
  • "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029–1040, 2007. (This paper shows how to extend MapReduce for relational data processing.)
  • FLuX: the Fault-tolerant, Load Balancing eXchange operator from UC Berkeley  provides an integration of partitioned parallelism with process pairs. This results in a more pipelined approach than Google's MapReduce with instantaneous failover, but with additional implementation cost.
  • "A New Computation Model for Rack-Based Computing" — paper by Foto N. Afrati; Jeffrey D. Ullman; from Stanford University
    Stanford University
    The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

    ; Not published as of Nov 2009. This paper is an attempt to develop a general model in which one can compare algorithms for computing in an environment similar to what map-reduce expects.
  • FPMR: MapReduce framework on FPGA -- paper by Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, Huazhong Yang (2010), in FPGA '10, Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays.
  • "Tiled-MapReduce: Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling" -- paper by Rong Chen, Haibo Chen and Binyu Zang from Fudan University
    Fudan University
    Fudan University , located in Shanghai, is one of the oldest and most selective universities in China, and is a member of the C9 League. Its institutional predecessor was founded in 1905, shortly before the end of China's imperial Qing dynasty...

    ; published in Proc. PACT 2010. It presents the Tiled-MapReduce programming model which optimizes resource usages of MapReduce applications on multicore environment using tiling strategy.
  • "Scheduling divisible MapReduce computations " -- paper by Joanna Berlińska from Adam Mickiewicz University and Maciej Drozdowski from Poznan University of Technology
    Poznan University of Technology
    Poznań University of Technology, PUT is a university located in Poznań, Poland. Poznań University of Technology is known as one of the best technical universities in Poland...

    ; Journal of Parallel and Distributed Computing 71 (2011) 450-459, doi:10.1016/j.jpdc.2010.12.004. It presents scheduling and performance model of MapReduce.
  • "Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing" -- paper by D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke from TU Berlin published in Proc. of ACM SoCC 2010. The paper introduces the PACT programming model, a generalization of MapReduce, developed in the Stratosphere research project.
  • "MapReduce and PACT - Comparing Data Parallel Programming Models" -- paper by A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, and D. Warneke from TU Berlin published in Proc. of BTW 2011.

Books

Educational courses

  • Cluster Computing and MapReduce course from Google Code University contains video lectures and related course materials from a series of lectures that was taught to Google software engineering interns during the Summer of 2007.
  • MapReduce in a Week course from Google Code University contains a comprehensive introduction to MapReduce including lectures, reading material, and programming assignments.
  • MapReduce course, taught by engineers of Google
    Google
    Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

     Boston, part of 2008 Independent Activities Period at MIT.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK