A primer on MapReduce

To first understand this very popular backend technology called MapReduce, let's take a look at Map and Reduce.

Terminology

The terms Map and Reduce are actually very popular higher-order functions used in functional programming.

Map

A map is used to apply a function on each element in some container and transform them.

For example, if I want to convert a list of numbers to strings in Javascript:

[0, 1, 2, 3].map(number => Integer.parseInt(number));

Reduce

Reduce is also known as folding or aggregation. The idea of reduce is to accumulate values, or merge them together.

For example, if I want to add a total of a list of numbers in Javascript:

const INITIAL_TOTAL = 0  
[0, 1, 2, 3].reduce((sum, number) => sum + number, INITIAL_TOTAL);

Distributed Processing

What's the point?

Distributed processing is the idea of doing some work (i.e. running an algorithm) in small parts across multiple machines.

For example, if you have to upload a 5 PB (petabyte) file into a storage service, but none of the servers have enough storage space for that, then what do you do? One tactic is to split the gigantic file into small chunks, and have multiple machines hold those chunks, rather than just having one single computer hold everything.

Google's paper on MapReduce

MapReduce is basically a distributed processing model for very large files. Hadoop and Amazon EMR are basically implementations of MapReduce. MapReduce allows the client to define the mapping functions and reducer functions. In addition, they can customize and tune the parameters for their workers to best fit custom use cases.

MapReduce works in two stages, Map and Reduce. They are quite familiar.

Distributed Workers and Master

Because we are dealing with very large files, we need to be extra careful about processing them. As mentioned above, a huge file will not fit into RAM, and a single machine will also not be able to hold the entire file in its hard disk either. Therefore, the entire file has to be split into pieces, scattered throughout multiple machines.

That's fine and dandy, but we also need a Master planner to keep track of which split is handled by which worker. Therefore, one worker out of the cluster of workers is elected as the master.

The master planner keeps track of which split belongs to which worker inside a meta-table that is housed inside its memory. The master host may go down sometimes but it's okay - it happens quite rarely and we can save the history of the master's actions somewhere and read from it, if we ever need to replace it. We can also just have a replica of the master, that is frequently in sync with the master - these just hold meta-information about the split/worker relationships, so it doesn't take up a huge amount of space.

Note: Distributed processing and distributed storage are two different things. MapReduce is expected to work with a distributed storage mechanism, such as GFS (Google File System), or HDFS for Hadoop.

Map Stage

The map stage is essentially synonymous with the functional programming map. A mapper function is used to take a key-value pair and transform them into the desired data format.

During the map stage, the master planner will keep track of which split (basically a piece of the file to process) is handled by which worker, using its own map meta-table. The results of the mapping are stored in the local disks of the workers, quietly waiting for the next stage.

Reduce Stage

The reduce stage, as explained earlier, is the stage where all the values are accumulated. In the context of key-value pairs, there will be multiple values with the same key. This is because the key-value pairs were mapped in smaller parts, across a cluster of machines. This means that all of the key-value pairs with the same unique key must be merged together, so that we have a key corresponding to a list of all values.

The reduce stage does this by reading off of the local disks of the workers. The master planner has its own separate meta-table for the reducer. When all keys are successfully reduced, or merged together, the output is now stored in the distributed file storage, which is configurable on behalf of the MapReduce adopter.

Implementation

In the code implementation of MapReduce, one caveat to be mindful of is that the Map and Reduce functions typically output into streams. Therefore, data is processed as strings, which means that properties associated with data types can be lost (for example, numbers and sorting). Some implementations allow you to process data as JSON, which can preserve types.

Another caveat is that you could often have multiple MapReduce steps. In the code snippet below, a second reducer is used to output sorted rating counts by movie ID.

Python example: MapReduce for counting IMDB movie ratings sorted by popularity:

from mrjob.job import MRJob  
from mrjob.step import MRStep

class RatingsBreakdown(MRJob):  
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_ratings,
                  reducer=self.reducer_count_ratings),
            MRStep(reducer=self.reducer_sort_out)
        ]

    def mapper_get_ratings(self, _, line):
        (userID, movieID, rating, timestamp) = line.split('\t')
        yield movieID, 1

    def reducer_count_ratings(self, key, values):
        # Zero padding allows number strings to be sorted by the MR streams
        yield str(sum(values)).zfill(5), key

    def reducer_sort_out(self, ratingCounts, movieIDs):
        for movie in movieIDs:
            yield movie, int(ratingCounts)

if __name__ == '__main__':  
    RatingsBreakdown.run()

Go example: Word Count

func Map(filename string, contents string) []mr.KeyValue {  
    // function to detect word separators.
    ff := func(r rune) bool { return !unicode.IsLetter(r) }

    // split contents into an array of words.
    words := strings.FieldsFunc(contents, ff)

    kva := []mr.KeyValue{}
    for _, w := range words {
        kv := mr.KeyValue{w, "1"}
        kva = append(kva, kv)
    }
    return kva
}

func Reduce(key string, values []string) string {  
    // return the number of occurrences of this word.
    return strconv.Itoa(len(values))
}

Practicality

In practice, code implementations of MapReduce have some concerns:

  • You have to write code for mappers and reducers
  • You have to debug and test the code, which can be difficult
  • You have to convert business logic and fit it into the realm of mappers and reducers, which can be difficult

For these reasons, there are frameworks such as Apache Pig (included with Hadoop) that allow you to define mappers and reducers without writing code.

A more popular alternative to use is Apache Spark, which is about 100x faster than Hadoop, while also allowing you to bypass the need to write explicit mappers and reducers. Spark achieves the performance benefits by storing most of the data in memory across a cluster of machines in distributed fashion, using a DAG scheduler.

With that said, developers are still able to explicitly define a mapper and reducer function with Hadoop or Spark if needed.

Conclusion

MapReduce is a programming model that allows you to process huge files by 1. transforming the input into small chunks across a cluster of workers and transforming them into a desirable state, and 2. accumulating the results by reducing them across a cluster of workers. MapReduce isn't a framework itself, but more so a concept. The most popular implementation of Google's MapReduce paper is Apache Hadoop. Although the code implementation of MapReduce is rather cryptic, it represents the underlying foundation of tools such as Hadoop or Spark.