Boost.MapReduce
Note: This library is not part of the Boost Library. It has existed in the Boost repository for a number of years where I have developed the library to the point it is today. There was little interest from the Boost community in adopting MapReduce as an official library, so I have moved it from their Subversion repository to here on GitHub. I hope this will make the project more accessible.
I will update the documentation to remove the Boost branding in due course. The Boost Distribution License is still appropriate for this project existing outside of Boost.
Copyright © 2009 Craig Henderson
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt
or copy at http://www.boost.org/LICENSE_1_0.txt)
Motivation
MapReduce is a programming model and distributed processing platform implementation for generating and processing large data sets using clusters of computers. Pioneered by Google and first presented in 2004, the MapReduce programming model has gained significant momentum in commercial, research and open-source projects since, and Google have updated and republished their seminal paper in 2008.
The scalability achieved using MapReduce to implement data processing across a large volume of CPUs, whether on a single server or multiple machines is an attractive proposition. The Boost.MapReduce library is a MapReduce implementation across a plurality of CPU cores rather than machines. The library is implemented as a set of C++ class templates, and is a header-only library. It does, however, depend upon many other Boost libraries, such as Boost.System, Boost.FileSystem and Boost.Thread.
Other Implementations
The Google MapReduce framework is written in C++ and is not made available publically. Hadoop is an Apache project implementation of MapReduce, originally developed as an infrastructure for the Nutch Java Search Engine project. Hadoop is written in Java, with interfaces to a number of programming languages including C++ and Python. This system includes a distributed file system HDFS (Hadoop Distributed File System), which is highly fault-tolerant and designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Phoenix is a shared-memory implementation of MapReduce. Phoenix can be used to program multi-core chips as well as shared-memory multiprocessors (SMPs and ccNUMAs) and is available from the original authors for the Sun Solaris operating system. A port to the Linux operating system is also available. The Phoenix source code is distributed under a BSD license and the copyright is held by Stanford University.
Phoenix runs on a single computer and implements MapReduce across a plurality of CPU cores rather than machines as in the Google and Hadoop implementations. This single-machine restriction simplifies the architecture significantly. In place of the distributed file system, Phoenix uses shared memory model for storing data to be processed, and the results. Each Map or Reduce task runs on a CPU core and the Phoenix runtime is responsible for consolidating results and load balancing (allocating data to Map and Reduce tasks). The complexities of network communication and fault tolerance are not required for the Phoenix framework on a single server.
Change History
- 29 April 2011
-
- Version 0.5
- Fixed links to downloads that have been removed from the Boost Vault
- Clean compilation with Microsoft Visual C++
- Version 0.5
- 29 Aug 2009
-
- Version 0.4
- Breaking change:
map_taskandreduce_tasknow must implement a function operator rather than static methodsmapandreduce. The functor signatures are the same the the previous static methods. - Fixed iteration in the case where each result key has multiple values
- Version 0.4
- 8 Aug 2009
-
Version 0.3
- Added
in_memoryintermediate handler for processing smaller dataset that can fit into main storage and availble the overhead of disk-based temporary storage. - Revised
map_taskandreduce_taskto provide required type defs through template parameters - Improved library interface
- Provided separate Test Program and Example Application
- Update documentation
- Added
- 26 Jul 2009
- Added parametrised file_handler on the datasource.
- Added memory mapped file support as an alternative to to std::ifstream
- Added examples directory with wordcount example
- Removed test directory
- Code clean-up
- 23 Jul 2009
- Added to Boost Sandbox (subversion)
- 21 Jul 2009
-
Version 0.2
- Moved the library into the
boostnamespace. - Created
PartitionFntemplate parameter onintermediates::local_diskto enable customisation of the partitioning of data into result files. - Use of
BOOST_THROW_EXCEPTIONin place ofthrow. - Rationalised and completed include guards
- Support for gcc 4.3.3 on Ubuntu Linux
- Moved the library into the
The latest updates can be found in on GitHub. I've packaged some snapshot versions into a Zip below; these are in a directory layout that conforms to the Boost sandbox. The GitHub sources are restructured to something more usable outside of a Boost library.
- 19 Jul 2009
-
Version 0.1
Initial public release on Boost Vault
References
- Title
- MapReduce: Simplified Data Processing on Large Clusters
- Author(s)
- Jeffrey Dean and Sanjay Ghemawat
- Appeared in
- OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
- URL
- http://labs.google.com/papers/mapreduce.html
- Title
- MapReduce: Simplified Data Processing on Large Clusters
- Author(s)
- Jeffrey Dean and Sanjay Ghemawat
- Appeared in
- Communications of the ACM 51(1) January 2008
- URL
- http://portal.acm.org/citation.cfm?id=1327492
- Title
- Evaluating MapReduce for Multi-core and Multiprocessor Systems
- Author(s)
- Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., & Kozyrakis, C.
- Appeared in
- Proceedings of the 13th Intl. Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ.
- URL
- http://mapreduce.stanford.edu/



