@rkenmi - Search Results

Big Data Processing: Batching vs. Streaming

Intro In data processing, we often have to work with large amounts of data. The way in which this data is gathered comes in a few variants: batching, where we aggregate a collection of data (e.g., by hourly time), streaming for data that needs to be processed in real-time, and a unified variant which simply does not distinguish the technical difference between batching and streaming, allowing you to programmatically use the same API for both.

January 3, 2022

A primer on MapReduce

To first understand this very popular backend technology called MapReduce, let's take a look at Map and Reduce. Terminology The terms Map and Reduce are actually very popular higher-order functions used in functional programming.

July 11, 2020

Seattle Conference on Scalability: YouTube Scalability

Notes Apache isn't that great at serving static content for a large number of requests vs. NetScaler load balancing Python is fast enough There are many other bottlenecks such as waiting for calls from DB, cache, etc.

December 25, 2020

Apache Kafka and Event Streaming

Introduction Apache Kafka is an open-source distributed event streaming platform. Traditional message brokers are based off of the JMS / AMQP standard. These message brokers focus on a pub/sub model where publishers write messages to a queue and the queue is consumed by subscribers.

March 31, 2021

Big Data Cheat Sheet

Data Warehousing Software Hadoop Apache Hadoop is a framework for large-scale, distributed jobs that consists of these main components: MapReduce: jobs are distributed into a group of mapper tasks and then reduced (combined) into a single output HDFS: A distributed file system used by Hadoop, which is shared across the Hadoop cluster.

September 3, 2024

Data stores in Software Architectures

Use Cases There are many ways to store your data. In this article we'll walk through some examples of data storage in common system designs. Reminder: There is no single best storage choice and they may vary heavily depending on things such as access patterns and scale.

July 19, 2021

Useful Links

This is a personal list of useful resources for improving web stacks, frameworks, development, UX, whatever that I come across! Software Engineering The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Encryption vs.

September 4, 2017

Distributed scaling with Relational Databases

Background A lot of articles will talk about how to scale databases. Typically, they will talk about the purpose and the general idea of sharding and replication, but often times these topics are explained separately and not so much in conjunction.

January 19, 2022

AWS and MLOps

Machine Learning Development Lifecycle The lifecycle of the machine learning development process often follows these steps: 1. Data Collection In this step, we fetch data from various sources. Common examples include a data lake, a data catalog, or streaming data (like Kafka, Kinesis).

May 18, 2024

CAP Patterns

The CAP Theorem dictates that only two of its three characteristics can be guaranteed at any given time. Intro to CAP Consistency Every read will be based off of the latest write Availability Every request will be given a response, although the response data might be stale Partition Tolerance It can handle network partitions or network failures MTV's The Real World If your service is in the cloud, the P in Partitioning has to always be accounted for.

December 16, 2020

Comparison Charts of File Storage Formats

Big Data Encodings These encodings are often used with HDFS or some other distributed file system. Since the data can be as large as terabytes or petabytes, it is crucial to encode files in a space optimal way and also allow themselves to be read or written in an optimal way.

January 24, 2022