Strategic Considerations for Successful Streaming Ingestion
The push towards real time data and streaming-first architectures has never been more pervasive than in our current big data and analytics landscape. The volume and velocity of data is ever increasing, causing strain on legacy architectures as they attempt to process it effectively. The complications of ingesting data from operational sources in near real time, transformed and optimized does not come without complexity.
In this comprehensive guide, we will walk you through Best Practices when deploying Modern Change Data Capture.
Change Data Capture is a low overhead and low latency method of extracting data, compared to traditional batch processes, limiting intrusion into the source and continuously ingesting and replicating data by tracking changes to the data. When designed and implemented effectively, CDC is the most efficient method to meet today’s scalability, efficiency, real-time and low overhead requirements.
It is imperative that your business has the right architecture in place to handle high throughput of data, a simplicity of replicating subsets of data and ever changing schema, as well as the capacity to capture the data and the changes exactly once, then replicate or ETL the CDC data to your data warehouse or data lakes, for analytics purposes.
Let’s begin by considering the “why” behind your move to a streaming-first architecture. Are you exploring new use cases to drive business growth with streaming data ingestion? Or are you watching existing use cases hit performance challenges with legacy tools and seeking new solutions?
You might consider the implementation of a batch and streaming approach, allowing you to maintain the batch processing that is functioning, and transitioning others to streaming where business gains or system efficiencies are clear. In either case, modern change data capture will be a vital component of your new architecture.
CDC is the gateway to unlocking two main ingestion use cases:
The main difference between the strategies is where the data manipulation is performed - in the ingestion pipe or in the analytics engine (DW or Data Lake) destination. Most organizations use a combination of these ingestion strategies, based on the organization needs, use cases and infrastructure.
Traditionally, however, organizations were forced to use different ingestion products for each use case, due to the fact that many legacy CDC Replication tools do not have real-time transformation, data manipulation, aggregation and correlation capabilities within the pipe. Alternatively, CDC Streaming Ingestion tools were not built to natively support CDC Replication use cases and are missing key capabilities for this use case.
With the addition of the Batch ETL use case, which every organization also uses in some capacity, organizations often find themselves with three different tools to achieve a full implementation of their data ingestion strategy.
Using three different tools to support all three use cases is not only extremely cost inefficient, but also leaves the organization with management and maintenance of three tools, different processes, different skill sets and no sense of single point of ownership for their data ingestion approach.
It is vital to use the right tool for the right job, but it’s also key to consolidate ingestion processes where possible. Centralize system monitoring. Lower costs on various platform licenses. Streamline your architecture for more transparency into data processing, scalability and ease of use.
Growing data ingestion requirements from analytic initiatives create additional workloads on the operational systems in an organization. Batch jobs will be scheduled for certain windows of the day to avoid source overhead during business hours, but often, batch loads push that boundary, negatively affecting day to day operations. While organizations will continue to use batch ETL to support predictable workloads for tasks like post-mortem analysis and weekly reporting, batch only architectures cannot keep up with modern velocity and volume of data that enterprises require. However, streaming data from the operational systems can also have a significant impact on the ability to guarantee SLAs, if not implemented properly. Change Data Capture is the single, most effective way to extract data from the operational system, without making application changes.
CDC techniques avoid reading entire data sets and instead focus on only the changes to the data, to extract incremental data. This technique eliminates the inherently expensive, traditional batch data extraction on source systems. There are several CDC techniques available in many popular data sources, each one offering different levels of performance, intrusiveness and maybe most importantly, data source overhead. For example, in a database, we can implement CDC in many different ways, such as triggers, JDBC queries based on a date pseudo column or a transaction log. The first two are extremely inefficient, both from performance perspective and more importantly database overhead concerns.
Database triggers are extremely expensive, synchronous operations that will delay every single database transaction to execute the trigger. JDBC query based extraction involves expansive range scans on the data tables that can create significant overhead even if the table is indexed and partitioned correctly. CDC using transaction log extraction is the most efficient way to extract changes in an asynchronous fashion, without slowing down operations and without additional I/O on the data tables. The impact on both performance and source overhead will differ significantly based on the chosen CDC strategy, in ways that can “make or break” the project, due to the level of risk introduced to the operational system.
Retrieving the current image of the data set before applying CDC changes is vital to achieve a fully functional replication process. This process is often done relatively frequently as part of the data set replication. Therefore, optimizing the initial capture phase, along with the synchronization of it with the CDC changes, is a key goal.
Here we will highlight a few, critical aspects of Initial Data Capture that ensure successful CDC replication:
A CDC ingestion process is only as quick as its weakest point. It does not matter how scalable your ingestion pipeline is if the CDC extraction method used on the source is slow. The source will almost always become the bottleneck of any CDC ingestion process. It’s also important to remember that not all CDC extraction methods and APIs are created equal. You must consider performance, latency and source overhead tradeoffs before choosing a CDC approach.
For example, in an Oracle database, we can implement CDC in many different ways, such as triggers or JDBC query based on a date pseudo column. We can even use Log Miner, which is considered a “true” CDC tracking by accessing the redo log. That said, even Log Miner, which is the preferred method out of the three, is limited to 1 core in the database, which roughly translates into anything up to 10,000 changes per second. Anything more than that and Log Miner will start accumulating lag. That lag will keep growing continuously until a restream occurs, afterwhich the lag will start growing once more.
There is, however, a fourth method available in this instance (tenfold more complicated than any of the other three approaches) - direct redo log parsing. A direct redo log parsing reads the redo log binary data and natively parses the bytes based on offset positions, essentially reverse engineering the binary output. While only a handful of solutions have this capability, this method offers speeds of up to 100,000 records per second (depending on the record size and storage speed), with even lower overhead on the database.
Bottlenecks can also temporarily happen throughout the replication pipeline, due to momentary network latency, source or target service delays and more. Therefore, the pipeline should also have backpressure mechanisms to avoid hitting memory limits in the pipe and protect the already pressured network or a disk. Applying simultaneous backpressure at various points in the data flow also alleviates strain on the system and ensures data accuracy.
In relevant databases that offer transaction approach (more than one change to the data is committed or rolled back all together), another layer of complexity is added to the replication process - reading uncommitted changes.
There are a few reasons to read uncommitted changes:
The challenge then, is the need to manage the uncommitted changes. For example, in a rollback scenario, you want to remove those changes from the persistence layer so it does not get replicated, and also does not waste disk space. The second important aspect is that persisting the uncommitted data creates another I/O process that can negatively affect performance, throughput and latency. It’s vital to choose the right persistence layer and the right level of asynchronous processing.
Achieving a true, exactly once guarantee in a cluster computing pipeline is a BIG challenge. Exactly once semantics guarantee means ensuring that data arrived from the source into the target once, not more than once, and no less. While this is a key requirement for a data replication use case, it’s also a key requirement for many data ingestion and ETL processes. It is much harder to achieve in streaming and data replication scenarios, versus traditional batch processing. When dealing with a multi-modal pipeline, as well as a source and a target, there are many potential risks for the end to end, exactly once semantics guarantee. Even if each component guarantees exactly once between itself to the next, only by having a recovery component governing the process end to end, exactly once semantics can be guaranteed.
Take Kafka for example, the Kafka consumer guarantees exactly once from the source read into Kafka, but it only covers issues that can occur while data is being delivered to the consumer. It does not cover source failures, nor will it cover target failures (wherever Kafka sends the data after it receives it), or any type of synchronization between what was sent from the source and what arrived at the target. When adding something like Spark for processing in between, there are now 4-5 different potential points of failures in the chain. Tracking the checkpoints of data integrity throughout the pipeline is the key to ensuring exactly once semantics guarantee.
Managing CDC replication at scale can be extremely challenging. Replication use cases often require handling dozens, hundreds or even thousands of objects. Consider handling a replication of a database schema, for example, which can contain thousands of tables. Each table needs to be validated for replication conflicts, unsupported data types, access permissions and many more aspects. Without the ability to handle multiple objects using a bulk management instrument, defining and day to day maintenance of the replication process can become an unwieldy task and often insurmountable.
In many cases, Do-It-Yourself developments or new age, commercial solutions were not built with replication use cases in mind. As a result, bulk management instruments for replication are simply not available. What at the development or POC stage might seem minor, becomes a true challenge down the pike when going into “real-life”production with actual use cases. The complexity only increases with DIY and new age commercial solutions when initial data capture and initial capture-CDC synchronization has to be done manually. As you think back to the thousands of objects to be replicated, consider the challenge of having to manually sync and load initial capture processes for each table schema.
In a word...punishing.
Replication Groups solve this problem, driving both operational and business efficiency with one-click bulk replication. Replication groups can facilitate large data migration, cross platform data warehousing (replicating to a data lake or data warehouse) and the managing of many thousands of objects in simple ways. Performance and Usability in replication are the two most critical aspects of every successful replication.
Data drift (also referred to as schema evolution) can create serious technical challenges as seemingly never ending mutations occur causing structural and semantic drift. As fields are added, modified or deleted at the source, or the meaning of the data evolves, data drift can occur, breaking the data pipeline and creating downtime for analytics.
Data drift is not a new challenge. It has been present in batch ingestion processes for centuries. When transitioning into streaming ingestion, however, data drift becomes tenfold more challenging for two main reasons:
In streaming pipelines, data drift should mostly be handled automatically, by a predefined set of rules that outline how each type of change is being treated. These rules can be simplistic and generic and can be fine-grained to the type of data, type of change, type of data flow and more. Without the pre-defined set of rules, a human intervention will be required and SLA disruption will occur.
Even after setting the rules, there’s a need to expect the unexpected and maintain both streams before and after the drift, maintain watermarks (to be able to order the events properly and reconcile the data in the target) and have an easy way to perform the right manipulation on the data to adapt to the new structure and re-stream the old-structured data to the target.
The last layer of data drift handling needs to include proper monitoring and alerting of the data drift incident for auditing and administration purposes.
Enabling multi-modal (ETL/ELT) CDC streaming ingestion capabilities requires getting more than just one component right. When looking at open-source frameworks, each framework enables part of that promise, however as standalone frameworks, they are not a solution.
The key is to have all of these technologies work together coherently, in a fully orchestrated environment. Achieving this coherent solution using all these open-source frameworks, however, is extremely challenging when attempting DIY solutions. Best case scenario...it takes a few years to design, develop and build. The system ends up working, but requires infinite maintenance, administration and development. Over 90% of these DIY projects end up failing, and over 60% of data engineers’s valuable time is spent on patches and fixes to complex, custom coded systems.
On top of all these risks, there’s the ultimate risk of getting locked into a framework that ends up being deprecated a few years down the road, similarly how Hadoop MapReduce is widely replaced today by popular frameworks such as Spark. The right orchestration solution needs to take all these risks into account, leverages the best open-source frameworks, simplifying the management, coding and administration hustle, but also takes into account that open-source projects often become deprecated when the new cool kid comes to town. Moreover, it’s important to consider that inevitability that the data volume and velocity will increase in the years to come. Today’s workloads are just a fraction of tomorrow’s workloads, you need to design the architecture that can scale with the growing demands.
Remember, CDC should be thought of as means to an end - a supplement to your Data Architecture and acting in service to your downstream data rather than a standalone effort.
Step One: Identify your primary streaming use case.
Step Two: See if you will have overlap in other use cases that could be accommodated by one tool versus many.
Step Three: Aim for data architecture simplification to optimize operations, management, monitoring and system sustainability.
Regardless of the use case, some final considerations:
Modern Change Data Capture is a vital component for a streaming-first data architecture. These guidelines and best practices will help achieve an optimized streaming pipeline architecture that will ingest data seamlessly and efficiently from any data source, including legacy data sources, enabling the analytics teams to provide the most enriched, up-to-date business insights to the organization decision makers. As streaming-first ingestion becomes the standard preferred data architecture to support business operations and visibility, you must make sure to have the right strategy in place to succeed.