Big Data Encodings

These encodings are often used with HDFS or some other distributed file system. Since the data can be as large as terabytes or petabytes, it is crucial to encode files in a space optimal way and also allow themselves to be read or written in an optimal way. For this reason, these encodings are generally not human readable (see the next section for more details on these).

File Encoding	Storage Format	Compression	Intended for	Schema Evolution	Other Notes
Avro	Row	Weak	Writes	Great	Works well with Kafka. Similar to Google Protobuf and Apache Thrift; allows for versioning and provides backwards/forward compatibility with schemas.
Parquet	Columnar	Good	Reads	Limited	Works well with Spark. Can store highly nested data.
ORC	Columnar	Great	Reads	Good	Works well with Hive. Supports ACID transactions.

Human Readable Encodings

These encodings are often used to structure data in a more convenient way for developers and end-users, with less importance on the data size (in memory or disk). For this reason, these formats are typically human readable.

For example, CSV and TSV are very popular output formats for data analysts who may use programs like Microsoft Excel. JSON formats are also very popular for passing data around with REST APIs, and it is convenient for web browsers since it works natively with Javascript.

File Encoding	Characteristics	Caveats	Other Notes
CSV, TSV, PSV	Comma Separated Values Tab Separated Values Pipe Separated Values	Cannot distinguish between string types and number types Ambiguity with parsing delimited values and newlines across implementations Ambiguity with column headers that might or might not be present Some formats are unescaped, causing additional complications when rows have embedded values such as double-quotes.	Lightweight, simple, widely used
JSON	Can represent deeply nested data Key-value dictionary is easy to read Works well for representing data on the front-end (Javascript)	Has issues representing numbers with precision (i.e. larger than 2^52) Can be quite bloaty in size Cannot insert raw binary strings; a hacky workaround is to encode them with Base64	Simple, widely used with REST/GraphQL
Binary JSON (e.g. BSON)	Represents JSON as a binary sequence to save space	End result is not human-readable. Space savings compared to original JSON are limited.
XML	Schema-friendly Verbose Can represent nested data	Similar to JSON, has issues representing numbers with precision (i.e. larger than 2^52) and inserting raw binary strings. Similar to CSV, it cannot distinguish between string types and number types Size is bloaty, and XML template can be too verbose and hard to read (as a human)	Used widely for configuration files and SOAP architectures.

Article Tags:

encodingscsvtsvparquetavroORC

Comparison Charts of File Storage Formats

Comparison Charts of File Storage Formats

Back to Top

<< NumPy vs. Pandas, and other flavors (Dask, Modin, Ray)

Distributed scaling with Relational Databases >>

Updated on April 23, 2023

Big Data Encodings

Human Readable Encodings

Article Tags:

<< NumPy vs. Pandas, and other flavors (Dask, Modin, Ray)

Distributed scaling with Relational Databases >>