Complex Event Processing: Challenging Real-Time ETL
Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
Improving BI Development Efficiency: Standard Data Extracts
A few years ago, a mission to Mars failed because someone forgot to convert U.S. measurement units to metric measurement units. Miles weren’t converted to kilometers.
I thought of this fiasco when reading a blog post recently that insisted that the only reasonable approach for moving data into a data warehouse was to position the data warehouse as the “hub” in a hub-and-spoke architecture. The assumption here is that data is formatted differently on diverse source systems, so the only practical approach is to copy all this data onto the data warehouse, where other systems can retrieve it
I’ve written about this topic in the past, but I wanted to expand a bit. I think it’s time to challenge this paradigm for the sake of BI expediency.
The problem is that the application systems aren’t responsible for sharing their data. Consequently little or no effort is paid to pulling data out of an operational system and making it available to others. This then forces every data consumer to understand the unique data in every system. This is neither efficient nor scale-able.
Moreover, the hub-and-spoke architecture itself is also neither efficient nor scalable. The way manufacturing companies address their distribution challenges is by insisting on standardized components. Thirty-plus years ago, every automobile seemed to have a set of parts that were unique to that automobile. Auto manufacturers soon realized that if they established specifications in which parts could be applied across models, they could reproduce parts, giving them scalability not only across different cars, but across different suppliers.
It’s interesting to me that application systems owners don’t aren’t measured on these two responsibilities:
- Business operation processing—ensuing that business processes are automated and supported effectively
- Supplying data to other systems
No one would argue that the integrated nature of most companies requires data to be shared across multiple systems. That data generated should be standardized: application systems should extract data and package it in a consistent and uniform fashion so that it can be used across many other systems—including the data warehouse—without the consumer struggling to understand the idiosyncrasies of the system it came from.
Application systems should be obligated to establish standard processes whereby their data is availed on a regular basis (weekly, daily, etc.). Since most extracts are column-record oriented, the individual values should be standardized—they should be formatted and named in the same way.
Can you modify every operational system to have a clean, standard extract file on Day 1? Of course not. But as new systems are built, extracts should be built with standard data. For every operational system, a company can save hundreds or even thousands of hours every week in development and processing time. Think of what your BI team could do with the resulting time—and budget money!
photo by jason b42882
Blurring the Line Between SOA and BI
photo by Siomuzzz
I recently read with interest an article in the Microsoft Architect Journal on so-called Service-Oriented Business Intelligence or, as the article’s authors call it, “SoBI.” The article was well-intentioned but confusing. What it confirmed to me is that plenty of experienced IT professionals are struggling to reconcile Service Oriented Architecture (SOA) concepts with business intelligence.
SOA is certainly a valuable tool in the architecture and development toolbox; however, I think it’s only fair to keep SOA in perspective. It’s an evolutionary technology in IT that has numerous benefits to developer productivity and application connectivity. I’m not sure that injecting SOA into a data warehouse environment or framework will do anything more than freshen a few low-level building blocks that have been neglected in some data warehouse environments. I’m certainly not challenging the value of SOA; I’m just trying to put in perspective to those folks that are focused on data warehouse and business intelligence activities.
The idea around SOA is to create services (or functions, procedures, etc.) that can be used by other systems. The idea is simple: build once, use many times. This ensures that important (and possibly complicated) application processes can be used by numerous disparate applications. It’s like an application processing supply chain: let the most efficient resource build a service and provide to everyone else for use. SOA provides a framework for allowing multiple applications access to common, well-defined services. These services can contain code and/or data.
The question for most data warehouse environment’s isn’t whether SOA can improve (or benefit) the data warehouse; it’s understanding how SOA can benefit a data warehouse.
We’ve got lots of clients leveraging SOA to support their data warehouse. They’ve learned they can leverage SOA techniques and coding to deliver standardized data cleansing and data validation to a range of business applications. They have also upgraded the operational system data extraction code to leverage SOA which allowed other application systems (or data marts) to reuse their code.
However, their use of the SOA hasn’t been focused on enhancing the data warehouse environment as much as has been focused on packaging their development efforts for others to use. Most data warehouse developers invest heavily in navigating IT’s labyrinth of operational systems and application data in order to identify, cleanse, and load data into their warehouses. What they’ve learned is that for every new ETL script, there are probably 20 other systems that have to custom developed their own data retrieval code and never documented it. The value that many data warehouse developers find with SOA isn’t that they are improving their data warehouse; they’re just addressing the limitations of the application systems.