@rkenmi - Search Results

A primer on MapReduce

... one single computer hold everything. Google's paper on MapReduce MapReduce is basically a distributed processing model for very large files. Hadoop and Amazon EMR are basically implementations of MapReduce.

July 11, 2020

Asynchrony vs. Multithreading

Asynchrony Asynchronous programming, also known as event-driven programming, is built on foundations of Futures/promises. The basic idea is that instead of having a thread wait for a blocked call to finish (i.

June 20, 2020

NumPy vs. Pandas, and other flavors (Dask, Modin, Ray)

NumPy NumPy is a Python library for numerical computing that offers multi-dimensional arrays and indices as data structures and additional high-level math utilities. ndarray The unique offering of NumPy is the ndarray data structure, which stands for n-dimensional array.

May 26, 2022

Big Data Processing: Batching vs. Streaming

Intro In data processing, we often have to work with large amounts of data. The way in which this data is gathered comes in a few variants: batching, where we aggregate a collection of data (e.g., by hourly time), streaming for data that needs to be processed in real-time, and a unified variant which simply does not distinguish the technical difference between batching and streaming, allowing you to programmatically use the same API for both.

January 3, 2022

Big Data Cheat Sheet

Data Warehousing Software Hadoop Apache Hadoop is a framework for large-scale, distributed jobs that consists of these main components: MapReduce: jobs are distributed into a group of mapper tasks and then reduced (combined) into a single output HDFS: A distributed file system used by Hadoop, which is shared across the Hadoop cluster.

September 3, 2024

DataFrames (a software engineer's perspective)

What is a DataFrame? A DataFrame is a special data structure used primarily by data scientists and machine learning algorithms. It contains row and column data in tabular fashion by storing metadata about each column and row.

May 2, 2022

Data stores in Software Architectures

Use Cases There are many ways to store your data. In this article we'll walk through some examples of data storage in common system designs. Reminder: There is no single best storage choice and they may vary heavily depending on things such as access patterns and scale.

July 19, 2021

Distributed scaling with Relational Databases

Background A lot of articles will talk about how to scale databases. Typically, they will talk about the purpose and the general idea of sharding and replication, but often times these topics are explained separately and not so much in conjunction.

January 19, 2022

NoSQL - the Radical Databases

NoSQL NoSQL is a category of databases that aren't relational. For example, MySQL would be a relational database, where as MongoDB would be a NoSQL database. Back then, relational databases were the tried-and-true, prevalent and reliable data stores.

December 31, 2020

Python Essentials

Scopes Python has closures, similar to Javascript, since functions are first class objects. But unlike Javascript, there are some subtle gotchas in regards to working with function scopes. nonlocal vs.

April 3, 2019

Sharding Techniques

Introduction Sharding can be summarized as a technique in which a database table can be split into multiple database servers to optimize read/write performance. Benefits include: Optimized query time Instead of having one huge database table, you have multiple smaller tables in more than one machine.

July 31, 2020

Webpack: Usage Examples

Webpack has been around since 2012 and it is a very popular tool nowadays. You'll see it mentioned in a lot of front-end stacks. I've personally used it to power this blog and a handful of my own React projects such as https://classic-ah.

June 13, 2020

Design Concepts

In this article, I want to go over some fundamental design concepts that are useful for coming up with system design. Requirements Functional Requirements Describes specific behaviors i.e. If a URL is generated, it is composed of a Base64 encoded alias Non-functional Requirements Describes architectural requirements i.

November 20, 2019

Atomic operations with Elasticsearch

Preface Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Key Terms: Document - Serialized JSON data.

May 6, 2020