Boost.MapReduce platform notes
Note: This library is not yet part of the Boost Library and is still under development and review.Microsoft Windows and MSVC 8 (2005)
This library has been developed and tested using Micrsoft Visual C++ v8, aka Visual Studio 2005. The code compiles cleanly for and runs as 32bit and 64bit processes on Windows XP 32Bit and Windows 2003 Server 64Bit Edition.
STL
The STL implementation supplied with Micrsoft Visual C++ v8 suffers significant performance problems as it includes indiscriminate fine granularity synchronisation locking. The MapReduce library is designed to be a high performance library and partitions data such that multiple threads can process data independently of other threads. The unnecessary overhead of locking in MSVC8's STL library negates some of the high-performance benefits of the library.
I therefore recommend using an alternative STL implementation to achieve maximum performance. I have tested the library with STLPort 5.2.1, compiled without thread support
STLport-5.2.1>configure msvc8 -p winxp -x --without-thread --with-dynamic-rtland have seen significant time differences. Using the Word Count example on a sample dataset consists of six plain text files consisting a total of 90.8 MB (95,284,354 bytes), the STLPort version ran in 26% of the time taken using the MSVC STL.
MapReduce Wordcount Application
2 CPU cores
class mapreduce::job<class wordcount::map_task,class wordcount::reduce_task,clas
s wordcount::combiner,class mapreduce::datasource::directory_iterator<class word
count::map_task>,class mapreduce::intermediates::local_disk<class wordcount::map
_task,struct mapreduce::detail::file_sorter,struct mapreduce::detail::file_merge
r> >
Running CPU Parallel MapReduce...
CPU Parallel MapReduce Finished.
MapReduce statistics:
MapReduce job runtime : 434 seconds, of which...
Map phase runtime : 418 seconds
Reduce phase runtime : 16 seconds
Map:
Total Map keys : 6
Map keys processed : 6
Map key processing errors : 0
Number of Map Tasks run (in parallel) : 2
Fastest Map key processed in : 8 seconds
Slowest Map key processed in : 389 seconds
Average time to process Map keys : 81 seconds
Reduce:
Number of Reduce Tasks run (in parallel): 2
Number of Result Files : 10
Fastest Reduce key processed in : 2 seconds
Slowest Reduce key processed in : 4 seconds
Average time to process Reduce keys : 5 seconds
MapReduce Wordcount Application
2 CPU cores
class mapreduce::job<class wordcount::map_task,class wordcount::reduce_task,clas
s wordcount::combiner,class mapreduce::datasource::directory_iterator<class word
count::map_task>,class mapreduce::intermediates::local_disk<class wordcount::map
_task,struct mapreduce::detail::file_sorter,struct mapreduce::detail::file_merge
r> >
Running CPU Parallel MapReduce...
CPU Parallel MapReduce Finished.
MapReduce statistics:
MapReduce job runtime : 116 seconds, of which...
Map phase runtime : 114 seconds
Reduce phase runtime : 2 seconds
Map:
Total Map keys : 6
Map keys processed : 6
Map key processing errors : 0
Number of Map Tasks run (in parallel) : 2
Fastest Map key processed in : 1 seconds
Slowest Map key processed in : 112 seconds
Average time to process Map keys : 19 seconds
Reduce:
Number of Reduce Tasks run (in parallel): 2
Number of Result Files : 10
Fastest Reduce key processed in : 0 seconds
Slowest Reduce key processed in : 1 seconds
Average time to process Reduce keys : 0 seconds
gcc 3.4.4 under cygwin
I have successfully compiled using GCC 3.4.4 under Cygwin, but do not have a full development environment with Boost et al. to run any tests.
$ g++ -Wall -c -I../../../.. -I/cygdrive/c/root/Development/Library/Boost/boost_1_39_0 wordcount.cpp
There are also some missing functions in the linux_os namespace which
I have not implemented. Any help implementing these for non-Windows platforms is appreciated.
namespace linux_os {
unsigned const number_of_cpus(void); // !!! not implemented
std::string &get_temporary_filename(std::string &pathname); // !!! not implemented
} // namespace linux_os
gcc 4.3.3 on Ubuntu Linux 9.04
I have successfully compiled using GCC 4.3.3 on Ubuntu Linux 9.04 (32bit), but do not yet have a full development environment with Boost et al. to run any tests.
$ g++ -Wall -c -I../../../.. -I/cygdrive/c/root/Development/Library/Boost/boost_1_39_0 wordcount.cpp




