Boost C++ Libraries

“...one of the most highly regarded and expertly designed C++ library projects in the world.” Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

Boost.MapReduce platform notes

Note: This library is not yet part of the Boost Library and is still under development and review.

Microsoft Windows and MSVC 8 (2005)

This library has been developed and tested using Micrsoft Visual C++ v8, aka Visual Studio 2005. The code compiles cleanly for and runs as 32bit and 64bit processes on Windows XP 32Bit and Windows 2003 Server 64Bit Edition.

STL

The STL implementation supplied with Micrsoft Visual C++ v8 suffers significant performance problems as it includes indiscriminate fine granularity synchronisation locking. The MapReduce library is designed to be a high performance library and partitions data such that multiple threads can process data independently of other threads. The unnecessary overhead of locking in MSVC8's STL library negates some of the high-performance benefits of the library.

I therefore recommend using an alternative STL implementation to achieve maximum performance. I have tested the library with STLPort 5.2.1, compiled without thread support

STLport-5.2.1>configure msvc8 -p winxp -x --without-thread --with-dynamic-rtl
and have seen significant time differences. Using the Word Count example on a sample dataset consists of six plain text files consisting a total of 90.8 MB (95,284,354 bytes), the STLPort version ran in 26% of the time taken using the MSVC STL.

MapReduce Wordcount Application
2 CPU cores
class mapreduce::job<class wordcount::map_task,class wordcount::reduce_task,clas
s wordcount::combiner,class mapreduce::datasource::directory_iterator<class word
count::map_task>,class mapreduce::intermediates::local_disk<class wordcount::map
_task,struct mapreduce::detail::file_sorter,struct mapreduce::detail::file_merge
r> >

Running CPU Parallel MapReduce...
CPU Parallel MapReduce Finished.

MapReduce statistics:
  MapReduce job runtime                     : 434 seconds, of which...
    Map phase runtime                       : 418 seconds
    Reduce phase runtime                    : 16 seconds

  Map:
    Total Map keys                          : 6
    Map keys processed                      : 6
    Map key processing errors               : 0
    Number of Map Tasks run (in parallel)   : 2
    Fastest Map key processed in            : 8 seconds
    Slowest Map key processed in            : 389 seconds
    Average time to process Map keys        : 81 seconds

  Reduce:
    Number of Reduce Tasks run (in parallel): 2
    Number of Result Files                  : 10
    Fastest Reduce key processed in         : 2 seconds
    Slowest Reduce key processed in         : 4 seconds
    Average time to process Reduce keys     : 5 seconds
MapReduce Wordcount Application
2 CPU cores
class mapreduce::job<class wordcount::map_task,class wordcount::reduce_task,clas
s wordcount::combiner,class mapreduce::datasource::directory_iterator<class word
count::map_task>,class mapreduce::intermediates::local_disk<class wordcount::map
_task,struct mapreduce::detail::file_sorter,struct mapreduce::detail::file_merge
r> >

Running CPU Parallel MapReduce...
CPU Parallel MapReduce Finished.

MapReduce statistics:
  MapReduce job runtime                     : 116 seconds, of which...
    Map phase runtime                       : 114 seconds
    Reduce phase runtime                    : 2 seconds

  Map:
    Total Map keys                          : 6
    Map keys processed                      : 6
    Map key processing errors               : 0
    Number of Map Tasks run (in parallel)   : 2
    Fastest Map key processed in            : 1 seconds
    Slowest Map key processed in            : 112 seconds
    Average time to process Map keys        : 19 seconds

  Reduce:
    Number of Reduce Tasks run (in parallel): 2
    Number of Result Files                  : 10
    Fastest Reduce key processed in         : 0 seconds
    Slowest Reduce key processed in         : 1 seconds
    Average time to process Reduce keys     : 0 seconds

gcc 3.4.4 under cygwin

I have successfully compiled using GCC 3.4.4 under Cygwin, but do not have a full development environment with Boost et al. to run any tests.

$ g++ -Wall -c -I../../../.. -I/cygdrive/c/root/Development/Library/Boost/boost_1_39_0 wordcount.cpp

There are also some missing functions in the linux_os namespace which I have not implemented. Any help implementing these for non-Windows platforms is appreciated.

namespace linux_os {
    unsigned const  number_of_cpus(void);                            // !!! not implemented
    std::string    &get_temporary_filename(std::string &pathname);   // !!! not implemented
}   // namespace linux_os

gcc 4.3.3 on Ubuntu Linux 9.04

I have successfully compiled using GCC 4.3.3 on Ubuntu Linux 9.04 (32bit), but do not yet have a full development environment with Boost et al. to run any tests.

$ g++ -Wall -c -I../../../.. -I/cygdrive/c/root/Development/Library/Boost/boost_1_39_0 wordcount.cpp