The growing analytics requirements by business decision makers forces data engineers to come up with innovative, effective ways to make data from operational systems available to BI and AI initiatives.
For these decision makers, the freshness of the data they are using to make decisions compared to the reality of the business (or in technical terms, are the analytic dashboards based on near real-time data), is a key element.
Batch ingestion does not meet this requirement as it does not enable near real-time replication of data, or streaming ingestion. It is also resource intensive and requires off-hour operation windows, that in some cases are not long enough, to avoid operational systems overhead during operation hours.
Change data capture (CDC) enables a different strategy - continuously streaming, ingesting and replicating data from operational systems in an incremental fashion, looking only at changes made to the data - possibly after initial extraction of historical data once.
CDC approach provides the benefit of being able to seamlessly stream data from operational systems, with minimal overhead (if performed right), eliminating long batch windows, providing a real-time view of the data to analytic initiatives and most importantly, without forcing any application changes. In most cases, the application does not even know changes are being tracked.
That said, there are many approaches, potential pitfalls, decision points, optimizations and considerations to achieving successful CDC ingestion.
There are usually multiple ways to extract changes from a data set, depending on the data source type. A CDC replication process is only as quick as its weakest point. Therefore, it does not matter how scalable your replication or ingestion pipeline is if the CDC extraction method used on the source is slow.
In an Oracle database, we can implement CDC in many different ways, such as triggers, JDBC queries based on a date pseudo column or a redo log extraction. The first two are extremely inefficient, both from performance perspective and more importantly database overhead concerns.
Database triggers are extremely expensive, synchronous operations that will delay every single database transaction to execute the trigger. JDBC query based extraction involves expansive range scans on the data tables that can create significant overhead even if the table is indexed and partitioned correctly.
CDC using redo log extraction is the most efficient way to extract changes in an asynchronous fashion, without slowing down operations and without additional I/O on the data tables. But even then, there is more than one way to extract changes from the redo log.
We can use Log Miner for CDC, an API provided by Oracle, originally intended for forensics analysis and manual logical recovery scenarios. LogMiner, however, was not intended to be used for Change Data Capture, but can be for now even as Oracle slowly deprecates some of its capabilities. It’s important to note that Log Miner is, however, limited to one core in the database which roughly translates into anything up to 10,000 changes per second. Anything more and Log Miner starts accumulating lag, which will keep growing continuously until a restream occurs, after which the lag will start growing again.
The most effective method available (tenfold more complicated than LogMiner or other approaches), is a direct redo log parsing. A direct redo log parsing reads the redo log in a binary fashion and parses the bytes based on offset positions, essentially reverse engineering the binary output. While only a handful of solutions have this capability, this method offers speeds of up to 100,000 per second, with even lower overhead on the database.
Performance, latency and source overhead tradeoffs are the key factors in picking the right CDC strategy.
In relevant databases that offer transaction approach (more than one change to the data is committed or rolled back all together), another layer of complexity is added to the replication process - Reading uncommitted changes.
There are a few reasons to read uncommitted changes:
The challenge, of course, is the need to manage the uncommitted changes. For example, in a rollback scenario, you might remove these changes from the persistence layer so it does not get replicated, but also does not waste disk space. The second important aspect is that persisting the uncommitted data creates another I/O process that can damage performance, throughput and latency, therefore it’s very important to choose the right persistence layer and the right level of asynchronous handling.
A replication often requires to first get the current “version” of the data and then apply all CDC changes to it. We call the first step initial capture. There are a few optimization concerns of initial capture and the synchronization between initial capture and CDC:
Initial capture, the process of extracting once the current state of the data set, should be robust and extract the data set using bulk, parallel processing methods. There is usually more than one way of extracting data from a source, either through JDBC, or APIs. Picking and optimizing the best approach is key to ensuring a multi-object replication maintains acceptable service levels. Each source has a unique, optimized way to perform bulk extraction. Restreaming is very common, due to many potential reasons, such as nightly loads, out of sync objects, failures and more, therefore initial capture is not a one-off process.
A replication process requires a combination of the existing state of the data set as well as all the capturing all the changes from that point in time. The point in time is typically referred to as the synchronization offset. This point is usually tracked by either a timestamp, but more accurately, by looking at an internal change sequence number. The synchronization offset is required to achieve a successful, exactly once data replication, regardless of the source type, for any source type that stores data at rest. Trying to manually sync between the initial capture and CDC changes is a tedious task that is prone to failures, due to the high frequency of changes in the data set. It’s important to ensure there are automatic, reliable ways to ensure a proper synchronization.
Decouple Initial Capture and CDC process dependency to Improve Performance:
Initial capture, even if optimized, can still take a long time to complete, due to the size of the data set. It’s important to eliminate dependency between the completion of the two processes, by allowing both processes to run in parallel. This allows the overall process of achieving a real-time replication to reach a “synced” state much quicker and in most sources the initial capture data and the CDC changes are not extracted from the same logical storage anyway, so leveraging that to improve overall performance and “time to synced” is recommended. This becomes especially critical when handling many objects in a queueing fashion, as the “time to synced” can delay further for every consecutive object in the queue.
These are just a few of the optimization challenges and considerations that are key to streaming ingestion using CDC. Getting the basics can feel relatively simple, but like anything in life, the devil is in the details...and these are just a few of them. A poorly executed DIY or even commercially acquired CDC operation can not only risk the analytic project, but more importantly may risk the operational system leaving the organization in jeopardy. Using these considerations to drive innovation in data ingestion from operational systems will result in positive gains for BI and AI objectives.
Interested in Learning More? Download our ebook in partnership with Enterprise Management Associates (EMA)
"6 Principles of Modern, Multi-Modal Change Data Capture | How to Build Fast, Agile, Reliable Data Pipelines"