@rkenmi - Search Results

A primer on MapReduce

... paper on MapReduce MapReduce is basically a distributed processing model for very large files. Hadoop and Amazon EMR are basically implementations of MapReduce. MapReduce allows the client to define the mapping functions and reducer functions.

July 11, 2020

NoSQL - the Radical Databases

... com.search.*, etc. for better data locality Examples: Bigtable, HBase, Cassandra Note: Hadoop and HBase is not the same thing. Hadoop consists of the Hadoop Distributed File System (HDFS), MapReduce, and a management bridge.

December 31, 2020

Big Data Processing: Batching vs. Streaming

... batch processing are files stored on disk. This is the case for MapReduce implementations such as Hadoop. These files may come from daily cronjobs or are exported from copies of OLTP (Online Transaction Processing) databases, such as a SQL database for inventory or customer purchases.

January 3, 2022

Big Data Cheat Sheet

Data Warehousing Software Hadoop Apache Hadoop is a framework for large-scale, distributed jobs that consists of these main components: MapReduce: jobs are distributed into a group of mapper tasks and then reduced (combined) into a single output HDFS: A distributed file system used by Hadoop, which is shared across the Hadoop cluster.

September 3, 2024

Data stores in Software Architectures

... ideas that come to mind: If we don't need real-time processing We can input this data into a Hadoop cluster and have a batch ETL (extract, transform and load) job to store the data into a Data Warehouse, such as Amazon Redshift.

July 19, 2021