Boost.MapReduce
Note: This library is not yet part of the Boost Library and is still under development and review.Copyright © 2009 Craig Henderson
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt
or copy at http://www.boost.org/LICENSE_1_0.txt)
Motivation
MapReduce is a programming model and distributed processing platform implementation for generating and processing large data sets using clusters of computers. Pioneered by Google and first presented in 2004, the MapReduce programming model has gained significant momentum in commercial, research and open-source projects since, and Google have updated and republished their seminal paper in 2008.
The scalability achieved using MapReduce to implement data processing across a large volume of CPUs, whether on a single server or multiple machines is an attractive proposition. The Boost.MapReduce library is a MapReduce implementation across a plurality of CPU cores rather than machines. The library is implemented as a set of C++ class templates, and is a header-only library. It does, however, depend upon many other Boost libraries, such as Boost.System, Boost.FileSystem and Boost.Thread.
Other Implementations
The Google MapReduce framework is written in C++ and is not made available publically. Hadoop is an Apache project implementation of MapReduce, originally developed as an infrastructure for the Nutch Java Search Engine project. Hadoop is written in Java, with interfaces to a number of programming languages including C++ and Python. This system includes a distributed file system HDFS (Hadoop Distributed File System), which is highly fault-tolerant and designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Phoenix is a shared-memory implementation of MapReduce. Phoenix can be used to program multi-core chips as well as shared-memory multiprocessors (SMPs and ccNUMAs) and is available from the original authors for the Sun Solaris operating system. A port to the Linux operating system is also available. The Phoenix source code is distributed under a BSD license and the copyright is held by Stanford University.
Phoenix runs on a single computer and implements MapReduce across a plurality of CPU cores rather than machines as in the Google and Hadoop implementations. This single-machine restriction simplifies the architecture significantly. In place of the distributed file system, Phoenix uses shared memory model for storing data to be processed, and the results. Each Map or Reduce task runs on a CPU core and the Phoenix runtime is responsible for consolidating results and load balancing (allocating data to Map and Reduce tasks). The complexities of network communication and fault tolerance are not required for the Phoenix framework on a single server.
Change History
- 29 Aug 2009
-
- DOWNLOAD v0.4
- Breaking change:
map_taskandreduce_tasknow must implement a function operator rather than static methodsmapandreduce. The functor signatures are the same the the previous static methods. - Fixed iteration in the case where each result key has multiple values
- DOWNLOAD v0.4
- 8 Aug 2009
-
DOWNLOAD v0.3
- Added
in_memoryintermediate handler for processing smaller dataset that can fit into main storage and availble the overhead of disk-based temporary storage. - Revised
map_taskandreduce_taskto provide required type defs through template parameters - Improved library interface
- Provided separate Test Program and Example Application
- Update documentation
- Added
- 26 Jul 2009
- Added parametrised file_handler on the datasource.
- Added memory mapped file support as an alternative to to std::ifstream
- Added examples directory with wordcount example
- Removed test directory
- Code clean-up
- 23 Jul 2009
- Added to Boost Sandbox (subversion)
- 21 Jul 2009
-
DOWNLOAD v0.2
- Moved the library into the
boostnamespace. - Created
PartitionFntemplate parameter onintermediates::local_diskto enable customisation of the partitioning of data into result files. - Use of
BOOST_THROW_EXCEPTIONin place ofthrow. - Rationalised and completed include guards
- Support for gcc 4.3.3 on Ubuntu Linux
- Moved the library into the
The latest updates can be found in the Boost Sandbox
- 19 Jul 2009
-
DOWNLOAD v0.1
Initial public release on Boost Vault
References
- Title
- MapReduce: Simplified Data Processing on Large Clusters
- Author(s)
- Jeffrey Dean and Sanjay Ghemawat
- Appeared in
- OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
- URL
- http://labs.google.com/papers/mapreduce.html
- Title
- MapReduce: Simplified Data Processing on Large Clusters
- Author(s)
- Jeffrey Dean and Sanjay Ghemawat
- Appeared in
- Communications of the ACM 51(1) January 2008
- URL
- http://portal.acm.org/citation.cfm?id=1327492
- Title
- Evaluating MapReduce for Multi-core and Multiprocessor Systems
- Author(s)
- Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., & Kozyrakis, C.
- Appeared in
- Proceedings of the 13th Intl. Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ.
- URL
- http://mapreduce.stanford.edu/



