@rkenmi - Search Results

NoSQL - the Radical Databases

NoSQL NoSQL is a category of databases that aren't relational. For example, MySQL would be a relational database, where as MongoDB would be a NoSQL database. Back then, relational databases were the tried-and-true, prevalent and reliable data stores.

December 31, 2020

Big Data Cheat Sheet

Data Warehousing Software Hadoop Apache Hadoop is a framework for large-scale, distributed jobs that consists of these main components: MapReduce: jobs are distributed into a group of mapper tasks and then reduced (combined) into a single output HDFS: A distributed file system used by Hadoop, which is shared across the Hadoop cluster.

September 3, 2024

Big Data Processing: Batching vs. Streaming

Intro In data processing, we often have to work with large amounts of data. The way in which this data is gathered comes in a few variants: batching, where we aggregate a collection of data (e.g., by hourly time), streaming for data that needs to be processed in real-time, and a unified variant which simply does not distinguish the technical difference between batching and streaming, allowing you to programmatically use the same API for both.

January 3, 2022

Comparison Charts of File Storage Formats

Big Data Encodings These encodings are often used with HDFS or some other distributed file system. Since the data can be as large as terabytes or petabytes, it is crucial to encode files in a space optimal way and also allow themselves to be read or written in an optimal way.

January 24, 2022

A primer on MapReduce

To first understand this very popular backend technology called MapReduce, let's take a look at Map and Reduce. Terminology The terms Map and Reduce are actually very popular higher-order functions used in functional programming.

July 11, 2020