Last modified: September 1, 2017
If you’re not familiar with the big data world, then the concept of change data capture (CDC) may seem odd. While the idea of analyzing DB changes certainly makes sense from a business standpoint, continuously recording changes for every insert, update, delete may seem a bit overboard to the unfamiliar. In this article, we'll explore what CDC is and why it makes sense when working with big data and data warehouses.
Change Data Capture is a collection of design patterns for determining and recording database change events. It identifies and captures data than has been modified in your DB.
Change Data Capture plays well with data warehouses which emphasize capturing and preserving data "state".
If CDC is just capturing database changes, why not just use table differencing or change-value selection? With table differencing, you can run diffs on entire tables to see differences. Additionally, change-value selection allows you to compare records based on specific columns like LAST_UPDATED.
There are a few issues with such alternatives. One is the computational cost of running diffs on larger tables. While smaller DBs won't cause issues, running MINUS queries to compare big tables can take a long time. Running analytical queries on master dbs or production environments can also slow down app performance. For these reasons, you may periodically export your DB to a staging environment for comparisons. The problem with this is transport costs grow exponentially with larger data sets.
Another issue with diffing is the inability to capture intermediary changes to your data. Let's say someone updates a field but then changes it again to it's original value. If you run a simple compare, you won't capture this change event with these alternative methods.
This is why Change Data Capture makes sense for larger data sets. It allows you to record and capture ALL change events without slowing down production environments or long running queries.
CDC captures individual data changes and avoids dealing with DBs as a whole. You don't need to compare whole tables or DB instances with CDC. This makes CDC both more accurate and efficient than other methods.
CDC can work either synchronously or asynchronously:
With synchronous CDC, triggers are used as part of every DB transaction to capture events immediately as they happen. For every insert, update, delete, a trigger runs to record the change activity.
Asynchronous CDC operates independently of transactions. It utilizes a redo log file for recording changes. This improves performance as CDC operations aren't directly tied to each transaction in your DB as they happen.
The real benefits of using CDC are only realized with larger data sets. You'll see CDC a lot with enterprise data warehouses that emphasize analytics and historical data comparisons.