@rkenmi - Comparison Charts of File Storage Formats

Comparison Charts of File Storage Formats


Comparison Charts of File Storage Formats


Back to Top

Updated on January 24, 2022

Big Data Encodings

These encodings are often used with HDFS or some other distributed file system. Since the data can be as large as terabytes or petabytes, it is crucial to encode files in a space optimal way and also allow themselves to be read or written in an optimal way. For this reason, these encodings are generally not human readable (see the next section for more details on these).

File Encoding Storage Format Compression Intended for Schema Evolution Other Notes
Avro Row Weak Writes Great Works well with Kafka. Similar to Google Protobuf and Apache Thrift; allows for versioning and provides backwards/forward compatibility with schemas.
Parquet Columnar Good Reads Limited Works well with Spark. Can store highly nested data.
ORC Columnar Great Reads Good Works well with Hive. Supports ACID transactions.

Human Readable Encodings

These encodings are often used to structure data in a more convenient way for developers and end-users, with less importance on the data size (in memory or disk). For this reason, these formats are typically human readable.

For example, CSV and TSV are very popular output formats for data analysts who may use programs like Microsoft Excel. JSON formats are also very popular for passing data around with REST APIs, and it is convenient for web browsers since it works natively with Javascript.

File Encoding Characteristics Caveats Other Notes
CSV, TSV, PSV Comma Separated Values
Tab Separated Values
Pipe Separated Values
Cannot distinguish between string types and number types
Ambiguity with parsing delimited values and newlines across implementations
Ambiguity with column headers that might or might not be present
Lightweight, simple, widely used
JSON Can represent deeply nested data
Key-value dictionary is easy to read
Works well for representing data on the front-end (Javascript)
Has issues representing numbers with precision (i.e. larger than 2^52)
Can be quite bloaty in size
Cannot insert raw binary strings; a hacky workaround is to encode them with Base64
Simple, widely used with REST/GraphQL
Binary JSON (e.g. BSON) Represents JSON as a binary sequence to save space End result is not human-readable.
Space savings compared to original JSON are limited.
XML Schema-friendly
Verbose
Can represent nested data
Similar to JSON, has issues representing numbers with precision (i.e. larger than 2^52) and inserting raw binary strings.
Similar to CSV, it cannot distinguish between string types and number types
Size is bloaty, and XML template can be too verbose and hard to read (as a human)
Used widely for configuration files and SOAP architectures.

Article Tags:
unlistedencodingscsvtsvparquetavroORC