MongoDb Schema Design

A major advantage of using MongoDb is its flexible and "schema-less" design. With MongoDb, you store records as a series of key/value pairs called documents within collections. These documents can have different fields with different data types, making for a more flexible and denormalized design. Additionally, documents can embed sub-collections to more efficiently manage data relationships.

While such flexibility allows your schema to quickly adapt to changing requirements, it poses certain challenges. Specifically managing relationships like one-to-many and many-to-many can be more difficult with a denormalized structure. In this article, we discuss the different approaches to handling relationships in MongoDb. We'll explore the advantages and disadvantages of embedding verse referencing and explore some real world examples and best practices for data modeling with MongoDb.

Preface: Understanding Cardinality

Data modeling for MongoDb requires a shift in thinking from more traditional RDBMS. While dbs like MySQL and PostGres emphasize the relationship between different tables, document dbs (like Mongo) represent related data as attributes or embedded data structures within documents.

Take a basic example of modeling a relationship between users and posts. A relational database may represent this like so:

User Table
id,  name, email,         createdAt,  updatedAt
112, Sam,  sam@gmail.com, 12-01-2017, 12-01-2017

Post  Table
id,   title,             location,      user_id, createdAt,  updatedAt
234,  "favorite pizza",  "Los Angeles", 112,     04-06-2017, 05-02-2017
432,  "need a break?",   "Sacramento ", 112,     04-06-2017, 05-02-2017

Notice how the relationship between users and groups is defined via the user_id foreign key on the groups table. With MongoDb, we could represent this same relationship like so:

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017,
  posts:[
    {title:"favorite pizza", location:"Los Angeles"},
    {title:"need a break?", location:"Sacramento"}
  ]
}

Instead of the posts being in a separate table, they are embedded within a User document. This nested modeling can provide much faster lookups as joins aren't required. We can retrieve a user and all of the posts in a single request.

What if a user has millions of posts? This could become problematic with documents having a max size of 16mb. Thankfully, MongoDb allows you to reference documents in other collections to create a one-to-many relationship:

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017,
  posts:[234, 432]

}

Post Documents
{
  id: 234,
  title: "favorite pizza",
  location: "Los Angeles"
}

{
  id: 432,
  title: "favorite pizza",
  location: "Los Angeles"
}

Notice how we are now referencing the ID from a separate Post collection to better accommodate an infinite number of entries.

These decisions (to embed or reference documents) are fundamental to MongoDb schema design. Specifically it is the cardinality of the relationships that dictates design decisions. Cardinality is a fancy way of saying "number of possibilities" that exist. When the cardinality of a one-to-many relationship (like user to posts) is low, embedding sub-documents is an efficient alternative to separate collections. Conversely, when there is high cardinality or an infinite number of possibilities that exist then separating data into different collections can be a good idea. This also allows you to query data independently of parent documents, etc.

Modeling relationships

There are limitless ways to design data models in MongoDb. A combination of embedding and referencing should be used based on cardinality of relationships and the nature of application queries. Below is a description of the different relationships including an example and advantages/disadvantages of each.

MongoDb One-to-Few

With a one-to-few relationship, embedding is most appropriate. For example, let's say you have a specific number of roles that a user can have:

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017,
  roles:[
    {title:"admin", canEdit:true},
    {title:"customer", canEdit:false}
  ]
}

Embedding makes sense here because there are set number of roles a User can have. Since we know the number of roles won't grow an infinite amount, embedding them directly in the User object makes sense. The main advantage of embedding is that you can retrieve the role information (along with the rest of the User object) in a single query. Embedding also makes atomic updates possible as everything is stored in the same object. The disadvantage is you can't query roles independently of the User. For example, finding the roles shared by two users will be more complex that it needs to with embedded documents.

MongoDb One-to-Many

Let's revisit the users to posts relationship. If a User has many posts and the Post object has many attributes, this can be too much to embed in a single document. Remember that the max size for a document is 16mb. In this situation, referencing makes sense.

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017,
  posts:[234, 432]

}

Post Documents
{
  id: 234,
  title: "favorite pizza",
  location: "Los Angeles"
}

{
  id: 432,
  title: "favorite pizza",
  location: "Los Angeles"
}

There are a few advantages to referencing with a one-to-many relationship. You prevent documents from growing out of control and can more easily query collections independently. For example, finding all posts with "pizza" in the title will be much easier if posts are stored in a separate collection. The main disadvantage of referencing is additional queries are required to get details from referenced documents. For example, returning the titles of each post for a user requires an additional application-level join to populate the referenced ids in the User object.

MongoDb One-to-Infinity

While referencing often works well for one-to-many relationships, situations exist where even referencing can't solve the growing document problem. Take the classic notion of users following other users. A single user could potentially have millions of followers. Even with just ID references, documents could exceed their max size with an infinite number of possibilities. For this scenario, child referencing is a good solution:

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017
}

Following Documents
{
  id: 234,
  follower: 234,
  following: 112,
  createdAt: 12-01-2017
}

Notice that the original User document has no attribute or sub-document listing "followers" or "following". Instead, a separate "following" collection references user ids.

This has several advantages. First, the growing document problem is solved as a user's followers are derived from a separate collection. The followers collection can also be queried independently (similar to a one-to-many relationship).

Two Way Referencing

A slightly more advanced technique can be used to get the best of both worlds. Taking the followers example a step further, we could do something like this:

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017,
  followers:[234]

}

Following Documents
{
  id: 234,
  follower: 234,
  following: 112,
  createdAt: 12-01-2017
}

In addition to referencing the User object in our Following document, we also reference a sub array of followers on the User document. This is known as two way referencing since both the User and Follower documents reference each other.

The main advantage of two way referencing is that it's easier to quickly find references without additional queries. For example, we can easily know how many followers a User has without having to query the Following collection. This is assuming followers isn't a one-to-infinity relationship. Also remember that getting the details of a user's followers requires an additional join query.

A major disadvantage of two way referencing is that atomic updates are no longer possible. If an update is made to a user's follower or a follower is removed, two updates are required instead of just one.

Denormalization

Denormalization is all about avoiding unnecessary application-level join queries. Building off our followers example, we could do something like this:

User Document
{
  id: 112,
  name: "Sam",
  email: "sam@gmail.com",
  createdAt: 12-01-2017,
  updatedAt: 12-01-2017,
  followers:[{id:234, name:"Fred"}]

}

Following Documents
{
  id: 234,
  follower: 234,
  following: 112,
  createdAt: 12-01-2017
}

Notice how we've also included the name attribute for each follower in the User object's followers array attribute. Now we don't have to run the additional query to retrieve each follower's name. The disadvantage of having to perform multiple updates remains. If we change a given user's name, we have to update all of the corresponding User documents referencing that user as a follower.

Mongodb Schema Design for Real-World Examples

In the real world, how you model your data should be largely dependent on cardinality and the nature of the application's queries. For example, if an application will be reading data frequently with little updates, denormalization and two way data referencing make sense. What's most appropriate for your particular use case depends on questions like:

"How frequently will I be updating the data?"

"How much will documents grow?"

A combination of the above mentioned techniques should be applied based on such questions.

Conclusion

There is no absolute right or wrong way to design your data models. It's really a case-by-case situation depending on the nature of your application. It's important to remember that cardinality is a central component to design decisions. By understanding the advantages/disadvantages of embedding vs referencing, you can maximize scalability with MongoDb.

>

Your thoughts?