Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
A few years ago, a mission to Mars failed because someone forgot to convert U.S. measurement units to metric measurement units. Miles weren’t converted to kilometers.
I thought of this fiasco when reading a blog post recently that insisted that the only reasonable approach for moving data into a data warehouse was to position the data warehouse as the “hub” in a hub-and-spoke architecture. The assumption here is that data is formatted differently on diverse source systems, so the only practical approach is to copy all this data onto the data warehouse, where other systems can retrieve it
I’ve written about this topic in the past, but I wanted to expand a bit. I think it’s time to challenge this paradigm for the sake of BI expediency.
The problem is that the application systems aren’t responsible for sharing their data. Consequently little or no effort is paid to pulling data out of an operational system and making it available to others. This then forces every data consumer to understand the unique data in every system. This is neither efficient nor scale-able.
Moreover, the hub-and-spoke architecture itself is also neither efficient nor scalable. The way manufacturing companies address their distribution challenges is by insisting on standardized components. Thirty-plus years ago, every automobile seemed to have a set of parts that were unique to that automobile. Auto manufacturers soon realized that if they established specifications in which parts could be applied across models, they could reproduce parts, giving them scalability not only across different cars, but across different suppliers.
It’s interesting to me that application systems owners don’t aren’t measured on these two responsibilities:
- Business operation processing—ensuing that business processes are automated and supported effectively
- Supplying data to other systems
No one would argue that the integrated nature of most companies requires data to be shared across multiple systems. That data generated should be standardized: application systems should extract data and package it in a consistent and uniform fashion so that it can be used across many other systems—including the data warehouse—without the consumer struggling to understand the idiosyncrasies of the system it came from.
Application systems should be obligated to establish standard processes whereby their data is availed on a regular basis (weekly, daily, etc.). Since most extracts are column-record oriented, the individual values should be standardized—they should be formatted and named in the same way.
Can you modify every operational system to have a clean, standard extract file on Day 1? Of course not. But as new systems are built, extracts should be built with standard data. For every operational system, a company can save hundreds or even thousands of hours every week in development and processing time. Think of what your BI team could do with the resulting time—and budget money!
photo by jason b42882
The recent acquisition of Sun by Oracle has raised a lot of speculative discussion about the latter vendor’s strategic pursuits. The move may or may not result in a power triumvirate of HP-IBM-Oracle. But Oracle expanding its portfolio to include hardware could be a game-changer.
Oracle has a dubious record with hardware plays. The nCube investment (circa 1988) and network computer idea (circa 1996) both presented interesting vision, but didn’t deliver tactically. NCube video-on-demand (circa 1994) ceded to decommissioning the product (circa 2001).
While many are focused on the state of Sun’s numerous DBMS partnerships, I’m more interested in the fate of Storage Technologies, which was acquired by Sun (circa 2002). Do a little research and you’ll see that EMC stores the lion’s share of DBMS data across enterprise data centers. If Oracle keeps the Storage Tech products it might shave some revenue from EMC and gain an even larger wallet share with IT organizations. Oracle’s intentions are equally unclear around the Exadata product, which had previously relied on the HP partnership that’s certainly strained. With the acquisition of Sun, Oracle is more able to go head-to-head with the likes of HP’s Neoview and Teradata.
Clearly the company has the option of producing a database appliance on its own. Personally I’m waiting to see the level of fear, uncertainty, and doubt Oracle stir up into the data warehouse appliance market. Oracle hasn’t differentiated its DBMS in years. The differentiation has always been about the company’s size, the number of Fortune 500 customers, and its broad array of application offerings, and that they work on every conceivable hardware platform. Focus on non-database products has fanned the flames of the market’s perception that databases are a mere commodity.
I can only imagine what’s going on in Oracle’s slideware development organization right now. Here are some of the messaging scenarios that are likely to be on the table:
Scenario 1: “Through our acquisition of Sun, we can now deliver a more fully-functional database appliance.”
In reality, the whole point of an appliance is to reduce complexity and configuration effort. Prepackaging Oracle on a hardware platform already occurs with companies like Sun, HP, and Dell. This isn’t simpler or better.
Scenario 2: “Oracle can now be your de-facto desktop and development tool provider.”
This one could actually be true. Oracle can leverage Sun’s vast software capabilities in two significant ways. With Sun’s desktop office suite, StarOffice, Oracle could provide a captivating alternative to the Microsoft Office monopoly. Any executive would find it difficult to ignore an Oracle office option, particularly in cases where they’ve made significant investments in Oracle as the corporate database standard. Plus, Oracle can monetize open source software by dramatically improving support revenue from these customers. Microsoft does not deliver customer service and support the way Oracle does—and enterprise clients expect more sophisticated and consistent support than the channel usually delivers.
Scenario 3: “Our Java-based toolset covers the spectrum of development needs without forcing your reliance on a specific vendor. Whether it’s middleware, server development, or reporting, we have the tools to support a multi-tier network enabled environment. You can now come to a single company for a single set of tools regardless of your platform type, desktop, server, or operating system.”
For IT organizations that still rely on custom development, this may dramatically reduce the number of suppliers they need. Over the past few years the number of middleware and application tool vendors has diminished—with Oracle being the buyer of many of them. Most IT organizations prefer fewer vendors. Whether open source or proprietary, the combined Oracle-Sun toolset offers Oracle a significant revenue stream in the support arena.
I’m fascinated that little or no attention has been paid to the software assets that Sun has. This combined with Oracle’s DBMS, middleware, and application toolsets offers an unexpected alternative to the ongoing IBM and Microsoft battles for enterprise development. Moreover, with Sun’s Java leadership and the popularity of Java in consumer electronics, Oracle can now enter into the world of consumer software, a la Apple. The opportunity for Oracle to support media companies that sell directly to the end consumer is wide open.
If it’s not careful, Oracle’s future may be in milking the legacy product cow instead of exploiting its newfound software assets. The real question is, is Oracle a company of innovators or bean counters?
photo by Siomuzzz
I recently read with interest an article in the Microsoft Architect Journal on so-called Service-Oriented Business Intelligence or, as the article’s authors call it, “SoBI.” The article was well-intentioned but confusing. What it confirmed to me is that plenty of experienced IT professionals are struggling to reconcile Service Oriented Architecture (SOA) concepts with business intelligence.
SOA is certainly a valuable tool in the architecture and development toolbox; however, I think it’s only fair to keep SOA in perspective. It’s an evolutionary technology in IT that has numerous benefits to developer productivity and application connectivity. I’m not sure that injecting SOA into a data warehouse environment or framework will do anything more than freshen a few low-level building blocks that have been neglected in some data warehouse environments. I’m certainly not challenging the value of SOA; I’m just trying to put in perspective to those folks that are focused on data warehouse and business intelligence activities.
The idea around SOA is to create services (or functions, procedures, etc.) that can be used by other systems. The idea is simple: build once, use many times. This ensures that important (and possibly complicated) application processes can be used by numerous disparate applications. It’s like an application processing supply chain: let the most efficient resource build a service and provide to everyone else for use. SOA provides a framework for allowing multiple applications access to common, well-defined services. These services can contain code and/or data.
The question for most data warehouse environment’s isn’t whether SOA can improve (or benefit) the data warehouse; it’s understanding how SOA can benefit a data warehouse.
We’ve got lots of clients leveraging SOA to support their data warehouse. They’ve learned they can leverage SOA techniques and coding to deliver standardized data cleansing and data validation to a range of business applications. They have also upgraded the operational system data extraction code to leverage SOA which allowed other application systems (or data marts) to reuse their code.
However, their use of the SOA hasn’t been focused on enhancing the data warehouse environment as much as has been focused on packaging their development efforts for others to use. Most data warehouse developers invest heavily in navigating IT’s labyrinth of operational systems and application data in order to identify, cleanse, and load data into their warehouses. What they’ve learned is that for every new ETL script, there are probably 20 other systems that have to custom developed their own data retrieval code and never documented it. The value that many data warehouse developers find with SOA isn’t that they are improving their data warehouse; they’re just addressing the limitations of the application systems.