Boost C++ Libraries

“...one of the most highly regarded and expertly designed C++ library projects in the world.” Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Boost.MapReduce

Note: This library is not part of the Boost Library. It has existed in the Boost repository for a number of years where I have developed the library to the point it is today. There was little interest from the Boost community in adopting MapReduce as an official library, so I have moved it from their Subversion repository to here on GitHub. I hope this will make the project more accessible.

I will update the documentation to remove the Boost branding in due course. The Boost Distribution License is still appropriate for this project existing outside of Boost.

Copyright © 2009 Craig Henderson

Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

Motivation

MapReduce is a programming model and distributed processing platform implementation for generating and processing large data sets using clusters of computers. Pioneered by Google and first presented in 2004, the MapReduce programming model has gained significant momentum in commercial, research and open-source projects since, and Google have updated and republished their seminal paper in 2008.

The scalability achieved using MapReduce to implement data processing across a large volume of CPUs, whether on a single server or multiple machines is an attractive proposition. The Boost.MapReduce library is a MapReduce implementation across a plurality of CPU cores rather than machines. The library is implemented as a set of C++ class templates, and is a header-only library. It does, however, depend upon many other Boost libraries, such as Boost.System, Boost.FileSystem and Boost.Thread.

Other Implementations

The Google MapReduce framework is written in C++ and is not made available publically. Hadoop is an Apache project implementation of MapReduce, originally developed as an infrastructure for the Nutch Java Search Engine project. Hadoop is written in Java, with interfaces to a number of programming languages including C++ and Python. This system includes a distributed file system HDFS (Hadoop Distributed File System), which is highly fault-tolerant and designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

Phoenix is a shared-memory implementation of MapReduce. Phoenix can be used to program multi-core chips as well as shared-memory multiprocessors (SMPs and ccNUMAs) and is available from the original authors for the Sun Solaris operating system. A port to the Linux operating system is also available. The Phoenix source code is distributed under a BSD license and the copyright is held by Stanford University.

Phoenix runs on a single computer and implements MapReduce across a plurality of CPU cores rather than machines as in the Google and Hadoop implementations. This single-machine restriction simplifies the architecture significantly. In place of the distributed file system, Phoenix uses shared memory model for storing data to be processed, and the results. Each Map or Reduce task runs on a CPU core and the Phoenix runtime is responsible for consolidating results and load balancing (allocating data to Map and Reduce tasks). The complexities of network communication and fault tolerance are not required for the Phoenix framework on a single server.

Change History

The latest updates can be found in on GitHub. I've packaged some snapshot versions into a Zip below; these are in a directory layout that conforms to the Boost sandbox. The GitHub sources are restructured to something more usable outside of a Boost library.

29 April 2011
Version 0.5
  • Fixed links to downloads that have been removed from the Boost Vault
  • Clean compilation with Microsoft Visual C++
29 Aug 2009
Version 0.4
  • Breaking change: map_task and reduce_task now must implement a function operator rather than static methods map and reduce. The functor signatures are the same the the previous static methods.
  • Fixed iteration in the case where each result key has multiple values
8 Aug 2009
Version 0.3
  • Added in_memory intermediate handler for processing smaller dataset that can fit into main storage and availble the overhead of disk-based temporary storage.
  • Revised map_task and reduce_task to provide required type defs through template parameters
  • Improved library interface
  • Provided separate Test Program and Example Application
  • Update documentation
26 Jul 2009
  • Added parametrised file_handler on the datasource.
  • Added memory mapped file support as an alternative to to std::ifstream
  • Added examples directory with wordcount example
  • Removed test directory
  • Code clean-up
23 Jul 2009
Added to Boost Sandbox (subversion)
21 Jul 2009
Version 0.2
  • Moved the library into the boost namespace.
  • Created PartitionFn template parameter on intermediates::local_disk to enable customisation of the partitioning of data into result files.
  • Use of BOOST_THROW_EXCEPTION in place of throw.
  • Rationalised and completed include guards
  • Support for gcc 4.3.3 on Ubuntu Linux
19 Jul 2009
Version 0.1
Initial public release on Boost Vault

References

Title
MapReduce: Simplified Data Processing on Large Clusters
Author(s)
Jeffrey Dean and Sanjay Ghemawat
Appeared in
OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
URL
http://labs.google.com/papers/mapreduce.html
Title
MapReduce: Simplified Data Processing on Large Clusters
Author(s)
Jeffrey Dean and Sanjay Ghemawat
Appeared in
Communications of the ACM 51(1) January 2008
URL
http://portal.acm.org/citation.cfm?id=1327492
Title
Evaluating MapReduce for Multi-core and Multiprocessor Systems
Author(s)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., & Kozyrakis, C.
Appeared in
Proceedings of the 13th Intl. Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ.
URL
http://mapreduce.stanford.edu/