Search Results


5 matches found for 'Apache Spark'

Big Data Processing: Batching vs. Streaming

... vs. alternatives Nowadays, a popular framework for batch data processing is Hadoop using Apache Spark for the dataflow engine. Spark was developed as a more performant alternative to the traditional Hadoop + MapReduce, by leveraging the memory (RAM) of distributed machines to compute workflows.


A primer on MapReduce

... you to define mappers and reducers without writing code. A more popular alternative to use is Apache Spark, which is about 100x faster than Hadoop, while also allowing you to bypass the need to write explicit mappers and reducers.


Big Data Cheat Sheet

... and provides a SQL interface for you to write SQL queries that translate into MapReduce, Tez, or Apache Spark jobs. Three main features: Write SQL queries for ETL and analytical data Access files from HDFS or other data storage systems like HBase A mechanism to impose structure on a variety of table formats Hive introduced one of the earliest concepts of an open table format.


Comparison Charts of File Storage Formats

Big Data Encodings These encodings are often used with HDFS or some other distributed file system. Since the data can be as large as terabytes or petabytes, it is crucial to encode files in a space optimal way and also allow themselves to be read or written in an optimal way.


Data stores in Software Architectures

Use Cases There are many ways to store your data. In this article we'll walk through some examples of data storage in common system designs. Reminder: There is no single best storage choice and they may vary heavily depending on things such as access patterns and scale.