What is Avro?

Avro is a data serialization system that allows big data to be exchanged between programs written in any language. In this article, we discuss what Avro is and provide an example of an Avro schema. We'll also compare Avro to other data formats like JSON and Parquet.

What is Avro?

Avro is a language-neutral data serialization system. It provides both data serialization and data exchange.

Data Serialization

Data serialization is the process of transforming data into a compact binary format so that it can more easily be transferred over a network or stored.

Data Exchange

Data serialization is fundamental to how Hadoop clusters communicate across different data nodes. Specifically, Hadoop leverages the RPC serialization format to internally serialize messages being sent across the cluster. This data exchange is efficient because it is compact, fast, and extensible. The protocols used to exchange information can easily evolve over time.

How Avro Works

Users write Avro schemas in JSON. These schemas describe the data structure and are stored along with the Avro data itself. This means less type information needs to be encoded with the data, leading to smaller data size. After Avro schemas are defined, they are read into the program by way of class files or direct parsers.

Once the schemas have been read, users can serialize and deserialize Avro data via the Avro API. The Avro API is available in the following languages:

  • C
  • C++
  • C#
  • Go
  • Haskell
  • Java
  • Perl
  • PHP
  • Python
  • Ruby
  • Scala

Avro Example

The following is an example of an Avro schema.

{
   "type" : "record",
   "namespace" : "ProjectName",
   "name" : "User",
   "fields" : [
      { "name" : "username" , "type" : "string" },
      { "name" : "age" , "type" : ["int", "null"] }
   ]
}

This is an example of a basic Avro schema. You'll notice we use JSON to define a document type, a namespace for the schema, and the schema name. These attributes, along with the fields array are required.

Notice how we set the fields attribute to an array of objects. Each object has a name and at least one supported type. We use a union with our age field to specify multiple data types, specifically "int" or "null".

It should be noted that Avro schemas support the following data types:

  • null
  • boolean
  • int
  • long
  • float
  • double
  • bytes
  • string
  • record
  • enum
  • array
  • map
  • union
  • fixed

Avro vs JSON

JSON doesn't automatically compress data like Avro does. If you store your data using JSON, you have to compress the data yourself. Using JSON is simpler as you don't have to define types and schemas. With JSON, the schema is embedded in the document itself. Remember that while Avro schemas are separated from the actual data, they are stored with the data in files.

Avro vs Parquet

Parquet is a column based storage format for Hadoop. Avro is a row based storage format. For more on the comparison between Avro and Parquet, check out Avro or Parquet?.

Conclusion

Avro functions as both a data serialization and data exchange system. Using Avro, users can define schemas written in JSON that are then read by applications via the Avro API. While more complex than JSON, Avro automatically compresses data into a more efficient format for network communication and persistent data storage.

Your thoughts?