Hadoop Data Formats: Avro or Parquet?

Deciding how to store data in HDFS is a complex decision. While there is no correct answer as to the best data format for your system, Avro and Parquet are both popular formats for storing big data in HDFS. In this article, we'll discuss the different data formats used with Hadoop and compare use cases for Avro and Parquet.

Hadoop Data Formats

Generally speaking, HDFS supports any data format. From video to text, you can store virtually anything in HDFS. Below is a list of the most popular data formats used with Hadoop along with a brief explanation:

Plain text

This includes things like CSV and tab delimited files. Such formats are easily readable as they aren't serialized or compressed. Plain text files are stored as is and are popular for archived data sets that don't require much processing.

Sequence Files

Sequence files were originally designed for Hadoop MapReduce. They solve the "small file problem" in Hadoop by aggregating smaller files into larger files which are splittable and support compression. This results in a format that is optimized for Hadoop MapReduce.

Avro

Avro is a data serialization framework that uses JSON to define data types and schemas. It serializes data in a compact binary format. Using Avro, you can store complex objects natively in HDFS. Unlike sequence files, Avro also allows for schema evolution since schema information is stored with the data itself.

Parquet

Parquet is a columinar format meaning column data is stored adjacent to each other. This drastically improves performance for queries where only specific columns are needed within a row of data. Parquet also provides superior compression and is splittable as well.

Avro vs Parquet

Both Avro and Parquet are more advanced than storing data as plain text. They both serialize and compress data, saving valuable storage space and improving performance.

Row based vs Column based Storage

One of the key differences between Avro and Parquet format is how they store data. While Avro is a row based format, parquet is a column based format. This means Avro stores records in rows while Parquet groups data by columns.

The main advantage of a row based format are returning whole rows of data! When you need every attribute or field associated with a record, then Avro easily allows you to retrieve the entire row or record of data. You should use Avro when you want to return entire data sets, including every field in a particular row.

The Avro format is also language independent. Avro data is described through schemas which can be shared across multiple apps using different languages.

The main advantage of column based format is faster data access. When you want to perform aggregations on larger data sets specifying only a few columns, Parquet is preferred. While more computationally intensive on the write side, Parquet is much faster with big data queries.

Conclusion

By partitioning datasets both horizontally and vertically, Parquet is most efficient for reads where only a subset of columns are required. Avro is better for storing data in row format and performing queries where entire records must be returned.

Both formats compress and split data which plays well with Hadoop MapReduce and Spark. While Parquet is more computationally intensive with large scale compression and writing, it's superior with big data queries. Avro, on the other hand, is more flexible with a self-contained schema for serialization and more inclusive reads/writes where entire rows must be returned.

Your thoughts?