MongoDB Schema Design: The 2 Questions to Ask Yourself

Note: The following is a more verbose discussion surrounding our first post on MongoDb Schema Design. We encourage you to check out that article before continuing...

Since MongoDB is a document database with a denormalized structure, you have to ask yourself different questions than you would with a more traditional RDBMS. Things like document growth and application level joins are important factors in deciding the relationships between different collections and documents. In this article, we discuss the two most important questions you need to ask yourself when designing a schema for MongoDB.

Preface: Normalized vs Denormalized Data

Traditioanl RDBMS are considered normalized databases meaning data is organized into different relational tables. Relationships exist between these tables to both reduce data redundancy and improve data integrity. For example, a Products table could include references to a separate Parts table.

The advantage of this design is atomic reads and writes can take place without duplicating data. You can read/write from the Products table and also perform joins to return all the parts for a particular product, etc.

By design, MongoDB is fundamentally different than these normalized databases. Through a document-oriented design, data is represented as embedded JSON objects with nested collections to represent relationships. For example, a Product document may include a nested collection of Parts.

There are a few advantages to this denormalized design. Since documents are essentially just JSON objects, your schema can evolve more freely over time. You can also perform faster reads/writes since your data entities are stored in singular nested JSON objects. For example, you wouldn't have to perform an extra join on a Products query to retrieve Parts information. It's all included in the same document!

With that said, it is possible to achieve some normalization with MongoDB. Theres nothing that stops you from separating Products and Parts into separate collections and referencing them from foreign collections. Whether or not this makes sense for your schema design is the premise of this article...

1) How will you query the data?

One of the most important questions to ask yourself when designing your schema is "How will I be querying the data?". This question is so important because it dictates how you structure your documents and the relationships between different entities. For example, let's say your data store will be comprised of different Products having different Parts. If your application will be primarily querying against different Products then nesting the different Parts within each Product document may make the most sense. You can perform faster reads on the Products collection and get Parts information without having to perform a join on a separate table.

But what if you want to query against Parts separately? In this case, you may want to store Parts as a separate collection and only store references to these different Parts for each Product. Although this means an application level join is necessary to get Part specific information for a given Product, you won't have to worry about querying every Product just to get information about different Parts.

Another option would be to store some of the Parts information (along with the reference to a given Part) within a given Product document. For example, if you want to get the names of the different Parts for a given Product without performing an application level join, you could store Part names along with the Part reference as sub collection within the Product document. The only downside to this is any updates on Parts would require two updates: one on the Part itself within the Parts collection and a separate update for every Product that references that given Part.

This exemplifies the importance behind the question "How will you query the data?". If you will frequently be performing reads on Products and only want the names for a Product's given parts, then storing the name along with the reference as part of Product makes sense. This allows you to keep Products and Parts as separate collections without having to perform application level joins. However, if you will be making frequent updates to the Parts collection, then this may not be good design as you will have to perform multiple updates each time you update a Part name.

2) How will your documents grow?

Document growth is another important thing to consider with MongoDB schema design. Let's say you have a User collection. Each User has a nested array of followers. While this nested subarray may simply include an Object ID reference to every other User following that particular User, things could get out of hand if a User has billions of followers. Even if the subarray is simply a collection of ID references, this could still be problematic if you have enough followers.

This is where the cardinality of data relationships comes into play. For a one to few relationship, nested references as described above makes sense. For one to infinity relationships, this could cause issues since documents can only grow so much (maximum BSON document size is around 16 megabytes).

Conclusion

Document growth and the nature of your application's queries are the most important things to consider when designing your MongoDB schema. There are trade offs to data normalization and schema design is more or less a balancing act based on the needs of your app.

Your thoughts?