Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
In the motion picture industry, studios separate responsibilities for creating content from responsibilities for distributing content. The people who make the movies option the scripts, hire the talent, and film the scenes. The distributors of the films, on the other hand, figure out how to package and deploy the films. They need to know which theaters require 30 millimeter versus 70 millimeter formats, or even IMAX. They also deal with DVD packaging, including different international DVD formats. The industry understands the importance of having a supply chain that differentiates between the roles of content creation, content packaging, and distribution.
In IT we’re very quick to point to our operational systems as creators and owners of data. But maybe the solution is that IT establishes a functional team that’s responsible for data packaging and distribution, just like the movie industry.
Traditionally data formats and standards have fallen into the realm of the architecture team. Unfortunately this is typically a paper-only activity without teeth. A data distribution team wouldn’t focus on paperwork. They would be focused on data logistics, receiving content from the various source systems and packaging the data for consumption by other systems. This isn’t about implementing a specific platform to store or move data. It’s about active management of corporate data content.
One of the biggest development challenges is the hunting expedition that developers go on to find and acquire the data they need. Most aren’t aware of all their choices, let alone the optimal systems of record.
Currently every application, data mart, data warehouse, reporting system that needs data from another system follows a specific set of procedures to obtain that data. Each system requests different data formats, different delivery schedules, and different content. Everything is custom, there are few if any standards, and there are no economies of scale.
This will also unburden the various application teams from building and maintaining the never ending volume of custom extract requests. The only way to stop the madness is to compartmentalize content creation from data packaging and distribution. This means establishing a data supply chain that separates data creators from data distribution from consumers. Who knew IT infrastructure was just like the movies?