Boost C++ Libraries

“...one of the most highly regarded and expertly designed C++ library projects in the world.” Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Boost.MapReduce Tutorial

Note: This library is not yet part of the Boost Library and is still under development and review.

This tutorial introduces the concepts and framework for MapReduce programming using the Boost library. Note that it is NOT a tutorial on the MapReduce programming idiom itself. Maybe that will follow one day...

Principles

As a library user, you specify a map function object that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function object that merges all intermediate values associated with the same intermediate key. These function objects are call MapTask and ReduceTask respectively.

map (k1,v1) --> list(k2,v2)
reduce (k2,list(v2)) --> list(v2)

MapReduce Job

A single instance of execution in MapReduce is called a Job, and is implemented by boost::mapreduce::job. The simplest definition of a MapReduce Job type just specifies the user-defined MapTask and ReduceTask:

typedef
mapreduce::job<
  wordcount::map_task,
  wordcount::reduce_task>
job;

The library's job class provides for more configuration than this, though.

template<typename MapTask,
         typename ReduceTask,
         typename Combiner=null_combiner,
         typename Datasource=datasource::directory_iterator<MapTask>,
         typename IntermediateStore=intermediates::local_disk<MapTask> >
class job;

MapTask

Requirements of a MapTask function object are

  • Provide type definitions for Map Key (k1) and Map Value (v1); key_type and value_type
  • Implement a function operator operator()()
struct map_task : public boost::mapreduce::map_task<
                             std::string,                            // MapKey
                             std::pair<char const *, char const *> > // MapValue
{
    template<typename Runtime>
    void operator()(Runtime &runtime, std::string const &key, value_type const &value) const;
};

The MapTask functor is derived from the boost::mapreduce::map_task to define the types required by the Boost.MapReduce library.

The map function is implemented as a template function with the first function parameter being a template type Runtime. This parameter is passed by the library to be used as a callback to emit intermediate key/value pairs. The other two parameters key and value are of types defined in the map_task template parameter list. Note that the const qualifiers on these parametersare optional, but recommended where possible.

ReduceTask

Requirements of a ReduceTask function object are

  • Provide type definitions for Reduce Key (k2) and Reduce Value (v2); key_type and value_type
  • Implement a reducer function operator operator()()
struct reduce_task : public boost::mapreduce::reduce_task<std::string, unsigned>
{
    template<typename Runtime, typename It>
    void operator()(Runtime &runtime, std::string const &key, It it, It const ite) const    
    {
        runtime.emit(key, std::accumulate(it, ite, 0));
    }
};

The ReduceTask functor is derived from the boost::mapreduce::reduce_task to define the types required by the Boost.MapReduce library.

The reduce function is implemented as a template function with the first function parameter being a template type Runtime, as with the map function above. The second template parameter defines an iterator type for function parameters 3 and 4 to bound the list of value pairs.

See the Word Count example for a detailed breakdown of a simple implementation.