Search Results


14 matches found for 'distributed processing'

A primer on MapReduce

... one single computer hold everything. Google's paper on MapReduce MapReduce is basically a distributed processing model for very large files. Hadoop and Amazon EMR are basically implementations of MapReduce.


Asynchrony vs. Multithreading

Asynchrony Asynchronous programming, also known as event-driven programming, is built on foundations of Futures/promises. The basic idea is that instead of having a thread wait for a blocked call to finish (i.


Traditional Message Queues vs. Log-based Message Brokers

Traditional Message Queues Traditional message queues are based off of the JMS / AMQP standard. These message brokers focus on a pub/sub model where publishers write messages to a queue and the queue is consumed by subscribers.


NumPy vs. Pandas, and other flavors (Dask, Modin, Ray)

NumPy NumPy is a Python library for numerical computing that offers multi-dimensional arrays and indices as data structures and additional high-level math utilities. ndarray The unique offering of NumPy is the ndarray data structure, which stands for n-dimensional array.


Big Data Processing: Batching vs. Streaming

Intro In data processing, we often have to work with large amounts of data. The way in which this data is gathered comes in a few variants: batching, where we aggregate a collection of data (e.g., by hourly time), streaming for data that needs to be processed in real-time, and a unified variant which simply does not distinguish the technical difference between batching and streaming, allowing you to programmatically use the same API for both.


DataFrames (a software engineer's perspective)

What is a DataFrame? A DataFrame is a special data structure used primarily by data scientists and machine learning algorithms. It contains row and column data in tabular fashion by storing metadata about each column and row.


Data stores in Software Architectures

Use Cases There are many ways to store your data. In this article we'll walk through some examples of data storage in common system designs. Reminder: There is no single best storage choice and they may vary heavily depending on things such as access patterns and scale.


Distributed scaling with Relational Databases

Background A lot of articles will talk about how to scale databases. Typically, they will talk about the purpose and the general idea of sharding and replication, but often times these topics are explained separately and not so much in conjunction.


NoSQL - the Radical Databases

NoSQL NoSQL is a category of databases that aren't relational. For example, MySQL would be a relational database, where as MongoDB would be a NoSQL database. Back then, relational databases were the tried-and-true, prevalent and reliable data stores.


Python Essentials

Scopes Python has closures, similar to Javascript, since functions are first class objects. But unlike Javascript, there are some subtle gotchas in regards to working with function scopes. nonlocal vs.


Sharding Techniques

Introduction Sharding can be summarized as a technique in which a database table can be split into multiple database servers to optimize read/write performance. Benefits include: Optimized query time Instead of having one huge database table, you have multiple smaller tables in more than one machine.


Webpack: Usage Examples

Webpack has been around since 2012 and it is a very popular tool nowadays. You'll see it mentioned in a lot of front-end stacks. I've personally used it to power this blog and a handful of my own React projects such as https://classic-ah.


Design Concepts

In this article, I want to go over some fundamental design concepts that are useful for coming up with system design. Requirements Functional Requirements Describes specific behaviors i.e. If a URL is generated, it is composed of a Base64 encoded alias Non-functional Requirements Describes architectural requirements i.


Atomic operations with Elasticsearch

Preface Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Key Terms: Document - Serialized JSON data.