... one single computer hold everything. Google's paper on MapReduce MapReduce is basically a distributed processing model for very large files. Hadoop and Amazon EMR are basically implementations of MapReduce.
0 CommentsAsynchrony Asynchronous programming, also known as event-driven programming, is built on foundations of Futures/promises. The basic idea is that instead of having a thread wait for a blocked call to finish (i.
0 CommentsTraditional Message Queues Traditional message queues are based off of the JMS / AMQP standard. These message brokers focus on a pub/sub model where publishers write messages to a queue and the queue is consumed by subscribers.
0 CommentsNumPy NumPy is a Python library for numerical computing that offers multi-dimensional arrays and indices as data structures and additional high-level math utilities. ndarray The unique offering of NumPy is the ndarray data structure, which stands for n-dimensional array.
0 CommentsIntro In data processing, we often have to work with large amounts of data. The way in which this data is gathered comes in a few variants: batching, where we aggregate a collection of data (e.g., by hourly time), streaming for data that needs to be processed in real-time, and a unified variant which simply does not distinguish the technical difference between batching and streaming, allowing you to programmatically use the same API for both.
0 CommentsData Warehousing Software Hadoop Apache Hadoop is a framework for large-scale, distributed jobs that consists of these main components: MapReduce: jobs are distributed into a group of mapper tasks and then reduced (combined) into a single output HDFS: A distributed file system used by Hadoop, which is shared across the Hadoop cluster.
0 CommentsWhat is a DataFrame? A DataFrame is a special data structure used primarily by data scientists and machine learning algorithms. It contains row and column data in tabular fashion by storing metadata about each column and row.
0 CommentsUse Cases There are many ways to store your data. In this article we'll walk through some examples of data storage in common system designs. Reminder: There is no single best storage choice and they may vary heavily depending on things such as access patterns and scale.
0 CommentsBackground A lot of articles will talk about how to scale databases. Typically, they will talk about the purpose and the general idea of sharding and replication, but often times these topics are explained separately and not so much in conjunction.
0 CommentsNoSQL NoSQL is a category of databases that aren't relational. For example, MySQL would be a relational database, where as MongoDB would be a NoSQL database. Back then, relational databases were the tried-and-true, prevalent and reliable data stores.
0 CommentsScopes Python has closures, similar to Javascript, since functions are first class objects. But unlike Javascript, there are some subtle gotchas in regards to working with function scopes. nonlocal vs.
0 CommentsIntroduction Sharding can be summarized as a technique in which a database table can be split into multiple database servers to optimize read/write performance. Benefits include: Optimized query time Instead of having one huge database table, you have multiple smaller tables in more than one machine.
0 CommentsWebpack has been around since 2012 and it is a very popular tool nowadays. You'll see it mentioned in a lot of front-end stacks. I've personally used it to power this blog and a handful of my own React projects such as https://classic-ah.
0 CommentsIn this article, I want to go over some fundamental design concepts that are useful for coming up with system design. Requirements Functional Requirements Describes specific behaviors i.e. If a URL is generated, it is composed of a Base64 encoded alias Non-functional Requirements Describes architectural requirements i.
0 Comments