Big Data Cheat Sheet

Data Warehousing Software

Hadoop

Apache Hadoop is a framework for large-scale, distributed jobs that consists of these main components:

  • MapReduce: jobs are distributed into a group of mapper tasks and then reduced (combined) into a single output
  • HDFS: A distributed file system used by Hadoop, which is shared across the Hadoop cluster. It greatly resembles and was inspired from the directory based file system in a Linux/Unix machine.
  • YARN: Yet another resource negotiator. Manages the resources within the cluster, and schedules tasks to nodes in the cluster. In this case, the tasks are MapReduce jobs.

Hadoop is the grand daddy of it all - the first in the big data processing realm to allow computations of data at massive scale to run.

Hive

Hadoop has one problem - it's not easy to write MapReduce applications in Java to perform ETL or gather analytics. That's why Apache Hive came into the picture. Apache Hive is a data warehousing framework that works on top of Hadoop and provides a SQL interface for you to write SQL queries that translate into MapReduce, Tez, or Apache Spark jobs.

Three main features:

  • Write SQL queries for ETL and analytical data
  • Access files from HDFS or other data storage systems like HBase
  • A mechanism to impose structure on a variety of table formats

Hive introduced one of the earliest concepts of an open table format. An open table format is an abstraction layer for data lakes and data warehouses which provides database like features, such as ACID transactions for CRUD (create, read, update, delete) and schema evolution. It allows large datasets to be processed efficiently.

Data Catalog

A data catalog in its simplest form manages all of your data for you.

  • You have data stored in multiple systems but only want to to locate data using one centralized service
  • Data needs to be governed, i.e. give proper access controls for sensitive data
  • Data compliance, for when laws like GDPR or DMA require you to submit or delete sensitive data if requested by another user
  • You want additional metadata and analytics for your data
  • Improve data quality with enforcements such as schema versioning, ACID transactions, etc.

Implementations

  • Hadoop Catalog
    • Typically used in the Hadoop framework for MapReduce jobs and interacting directly with HDFS
    • Doesn't have ACID transaction support
  • Hive Metastore
    • Apache Hive added the enhancement for allowing Hadoop MapReduce jobs to run via SQL queries.
    • The Hive Metastore is the data catalog for Hive, which helps structure data into databases, tables, partitions. For example, data can be written to a Hive-style path such as s3://test/id=1/name=Fred which will write partitioned data based on the fields id, name.
  • AWS Glue
    • Catalog over the AWS cloud
    • Supports various table formats, including Hive and Iceberg.

Open Table Formats

Hudi

Iceberg

Delta Lake

Data Handling Strategies

Copy-on-Write

This is a term that was popularized by Apache Hudi community. Basically, copy-on-write means that old data files will be copied into new files. Then, updates and deletes will be applied on top of the new files.

  • Great for optimizing read/query performance
  • If a single row in a single file has changed, the whole file is replaced. Thus, you'll have slower write performance.

Merge-on-Read

This pattern, in contrast to copy-on-write, will not rewrite files. Instead, it creates new Delete files with records that are marked for deletion. Update files work similarly - a new file with records marked for updates.

  • Fast updates and deletes
  • Iceberg and Delta Lake does not support it
  • Slow reads, as it will fetch separate files for updates and deletes to merge.

Equality vs. Positional Deletes

Within Iceberg, you have the choice of defining Delete files that are equality based or positional based.

Compute

Data reads and writes at a large scale requires plentiful hardware and a stable infrastructure.

Spark

Apache Spark is an in-memory compute engine that is most popular for it's usage in ETL (Extract, Transform, Load) jobs.

In the old days, the Hadoop framework was used to execute ETL jobs. With Hadoop jobs, tasks are executed in a distributed fashion under the MapReduce paradigm. During each BSP step in a MapReduce job, intermediate results were written to the HDFS (Hadoop File System) storage layer, which ultimately reside in the storage disk of nodes. YARN would be used to manage the nodes in a Hadoop cluster and schedule MapReduce tasks to the nodes appropriately.

Nowadays, Spark processing is used as a replacement for MapReduce. Spark allows the intermediate results that were written to HDFS in MapReduce to be written to memory instead, allowing computed values to be cached and fetched much faster. Spark jobs are split into stages, and YARN is still used within Hadoop to schedule tasks to Spark executors (worker nodes).

Ray

Data Warehouse Modeling

Kimball Dimensional Modeling

Dimensional Modeling is a data warehousing model that starts ground-up from the business process and revolves around the concepts of facts (quantitative measures) and dimensions (context). To model, the four-step process below is used.

  1. Business Process

    • The business process could be something as simple as "I would like to query a customer's street address". This requirement dictates the significance of what data to store.
  2. Grain

    • The grain of the model is the exact description of what the dimensional model should be focusing on. This could for instance be “An individual line item on a customer slip from a retail store”.
  3. Dimensions Table

    • The Dimensions table provides context around data from the facts table. Dimensions tables are generally named based on nouns. For example, if we are modeling business data for a shopping website, one dimensions table could be "DTRANSACTIONS", a table for all purchase transactions made by customers. Another one can be "DSHIPPINGINFO", for all shipping addresses recorded by each customer. The prefix "D" is arbitrarily added to signify "dimensions".
  4. Facts Table

    • The Facts Table stores quantitative measures, usually in the form of numeric types. These values are usually produced by operational measurement events in the real world. These fields are used as the PRIMARY KEYS to the Dimensions Table to provide context data. For example, in "DSHIPPINGINFO" dimension table above, we will likely need some identifier that represents a specific customer, thus the table could have "CUSTOMERID" as a PK, and "CUSTOMERID" will be a field in the Facts Table.