Search Results


6 matches found for 'Spark'

Big Data Processing: Batching vs. Streaming

... vs. alternatives Nowadays, a popular framework for batch data processing is Hadoop using Apache Spark for the dataflow engine. Spark was developed as a more performant alternative to the traditional Hadoop + MapReduce, by leveraging the memory (RAM) of distributed machines to compute workflows.


A primer on MapReduce

... to define mappers and reducers without writing code. A more popular alternative to use is Apache Spark, which is about 100x faster than Hadoop, while also allowing you to bypass the need to write explicit mappers and reducers.


Big Data Cheat Sheet

... a SQL interface for you to write SQL queries that translate into MapReduce, Tez, or Apache Spark jobs. Three main features: Write SQL queries for ETL and analytical data Access files from HDFS or other data storage systems like HBase A mechanism to impose structure on a variety of table formats Hive introduced one of the earliest concepts of an open table format.


DataFrames (a software engineer's perspective)

... needs to be specified at initialization. Modin is still pretty young and early in development. Spark DataFrames Spark is primarily Java / Scala based, which might be difficult to work with when passing datasets over from Python.


Data stores in Software Architectures

... intended for frequent writes. If we do need real-time processing We could input this data into Spark Streaming and store the output into a database like Cassandra or a message queue like RabbitMQ for notifications.


Comparison Charts of File Storage Formats

... with schemas. Parquet Columnar Good Reads Limited Works well with Spark. Can store highly nested data. ORC Columnar Great Reads Good Works well with Hive.