Boost.MapReduce Future Work
Note: This library is not yet part of the Boost Library and is still under development and review.This is the first release of the MapReduce library, and there are a few features that I'd still like to do.
-
Improve support for other platforms. This will require help from the Boost development community.
-
Add a
PartioningFunctionparameter inlocal_diskintermediate handler to enable custominsation of the partitioning of data into the final result files. -
Add a template to the
SortFnsort function to prevent expansion of duplicates if required. (For example, this expansion contradicts thecombinerin wordcount, and eliminating the two would improve performance considerably). -
An extension to the
intermediates::local_disk<>policy class could be to compress the intermediate files, using the Boost.Iostreams zip/bzip2 compression libraries. This is a long-term item that will be very useful when the library is extended to supported cross-machine MapReduce. Until then, the value is very limited.
Multiple Machine Support
MapReduce was originally designed as a mechanism for working on large datasets across many (1000s) of commodity servers. The current Boost library works across a plurality of CPU cores on a single machine. There is a big jump to multi-machine support, so this is a long-term goal, but a goal nonetheless.
Distributed File System
To support the MapReduce across multiple machines, some form of distributed file system is required. I have begun development of one using Boost libraries (primarily Boost.FileSystem and Boost.Asio). The question is going to be whether this really sits within Boost as a C++ library, or whether it is really a runtime environment for MapReduce to sit atop. My feeling is that there is some value in having a scalable and resilient DFS which is peerless and heterogenous across all platforms as a library that can be built into an application, but whether that is the really remains to be seen.



