Intro

In data processing, we often have to work with large amounts of data. The way in which this data is gathered comes in a few variants: batching, where we aggregate a collection of data (e.g., by hourly time), streaming for data that needs to be processed in real-time, and a unified variant which simply does not distinguish the technical difference between batching and streaming, allowing you to programmatically use the same API for both.

Batching

Batch data processing can also be known as offline jobs. When we say batching, we typically mean that we want to group up some data, run some kind of job/operation on it (i.e. MapReduce) and send the results to some output (i.e. a database or data warehouse).

Batch processing is generally done at larger companies where huge amounts of metadata (thousands of petabytes) needs to go through some transformation. Since the data size is huge, 1. the time taken can be very long (hours or days), and 2. we need a reasonable framework to execute the transformation logic at big data scale.

Input

Batch processing relies on bounded data. Traditionally, the input for batch processing are files stored on disk. This is the case for MapReduce implementations such as Hadoop. These files may come from daily cronjobs or are exported from copies of OLTP (Online Transaction Processing) databases, such as a SQL database for inventory or customer purchases.

The intuition behind the input being files is that large amounts of data simply fit easier into disk than on memory. Disks are also resilient and durable (enough) compared to memory, which will not persist contents in an inevitable scenario where the host dies from a hardware failure.

Since input files can be very large, they are often split up into multiple files (or partitions) and compressed to reduce disk space usage (e.g. Parquet, Avro). Other lightweight formats such as CSV and TSV are often used too.

Output

The output of batch processing has a few popular use cases. One of them is bulk-importing the output to OLAP (Online Analytics Processing) databases, which are more commonly known as Data Warehouses. Data Warehouses do not need to serve near real-time traffic (i.e. a live website's database queries) and are tweaked for analytics access patterns; generally full table scans and read-only queries. Since data analysts may not pull every column in a query, these databases are also read-optimized by indexing singular columns, rather than rows with every single column.

Data Warehouses and SQL databases are similar in that they use the same query interface (SQL). But Data Warehouses are optimized for entire table reads (which can also be optimized at the column level) and are not meant to support real-time writes. SQL on the other hand uses row based indexing and are more general purpose (can be tweaked however you like).

Another popular use case is to supply this data to data scientists for machine learning use, such as for building a classification model.

MapReduce vs. alternatives

Nowadays, a popular framework for batch data processing is Hadoop using Apache Spark for the dataflow engine. Spark was developed as a more performant alternative to the traditional Hadoop + MapReduce, by leveraging the memory (RAM) of distributed machines to compute workflows. In contrast, traditional MapReduce workflows consist of a chain of MapReduce jobs, with each job storing its intermediary file output into HDFS only to become the file input for the next MapReduce job. Spark avoids storing these intermediary files onto disk - it instead places them in memory.

Note: Keep in mind that Spark does not come with its own distributed data storage. This means that Spark can be used with Hadoop, which can still use HDFS for the input and output of workflows. The takeaway here is that Spark can also be used without Hadoop - for example, it can use S3 or Cassandra instead for the data storage.

Real-world Examples

A great example of batching is gathering daily analytics based on server logs for a production-ready & user-facing service. The gathering of the logs may take a while, and the operation for transforming the data into what we want can also take a while, but that's fine. We care more about atomicity and availability here, versus length of time it might take to actually record all of our analytics to our database.

Streaming

Streaming data is for data where we need to do real-time processing. For example, in a shopping app, we might want to notify users if an item is in stock only if they have added it to their wishlist.

Streaming works off of unbounded data, or an infinite data stream. Data is continuously flowing and continuously processed.

Streaming frameworks come in many flavors but Apache Flink and Apache Kafka are perhaps the most popular open-source variants. Kafka itself has a wider use case as it is primarily used as a pub-sub messaging broker.

Apache Spark is traditionally used for batching, but can also be used for streaming data with some caveats. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can then be pushed out to file systems, databases, and live dashboards.

One thing to note however is that Spark Streaming is more accurately a micro-batch data processing framework.

Batch and Streaming (Unified)

It may be the case that we want to experiment with both batching and streaming, or at least have the ability to be portable enough to switch from one data processing engine, i.e. Spark, to another without re-writing or refactoring a bunch of Spark code.

Apache Beam and Apache Flink are both high-level abstractions where they unify both batching and streaming under the same APIs and allow for multiple data processing engines. In the case of Beam, Beam has runners for each data processing engine, such as the Spark Beam Runner.

The benefits of Apache Beam are the portability and the modular design of the runners, allowing you to switch from one engine to another with ease. However, using Spark natively has its own set of benefits, such as higher performance (when properly tweaked).

Article Tags:

SparkstreamingapacheApache SparkApache Hadoopbatchbatchingbig dataApache BeamApache Flink

Big Data Processing: Batching vs. Streaming