I was recently asked about my opinion for the potential of Hadoop replacing a company’s data warehouse (DW). While there’s lots to be excited about when it comes to Hadoop, I’m not currently in the camp of folks that believe it’s practical to use Hadoop to replace a company’s DW. Most corporate DW systems are based on commercial relational database products and can store and manage multiple terabytes of data and support hundreds (if not thousands) of concurrent users. It’s fairly common for these systems to handle complex, mixed workloads –queries processing billions of rows across numerous tables along with simple primary key retrieval requests all while continually loading data. The challenge today is that Hadoop simply isn’t ready for this level of complexity.
All that being said, I do believe there’s a huge opportunity to use Hadoop to replace a significant amount of processing that is currently being handled by most DWs. Oh, and data warehouse user won’t be affected at all.
Let’s review a few fundamental details about the DW. There’s two basic data processing activities that occur on a DW: query processing and transformation processing. Query processing is servicing the SQL that’s submitted from all of the tools and applications on the users’ desktops, tablets, and phones. Transformation processing is the workload involved with converting data from their source application formats to the format required by the data warehouse. While the most visible activity to business users is query processing, it is typically the smaller of the two. Extracting and transforming the dozens (or hundreds) of source data files for the DW is a huge processing activity. In fact, most DWs are not sized for query processing; they are sized for the daily transformation processing effort.
It’s important to realize that one of the most critical service level agreements (SLAs) of a DW is data delivery. Business users want their data first thing each morning. That means the DW has to be sized to deliver data reliably each and every business morning. Since most platforms are anticipated to have a 3+ year life expectancy, IT has to size the DW system based on the worst case data volume scenario for that entire period (end of quarter, end of year, holidays, etc.) This means the DW is sized to address a maximum load that may only occur a few times during that entire period.
This is where the opportunity for Hadoop seems pretty obvious. Hadoop is a parallel, scalable framework that handles distributed batch processing and large data volumes. It’s really a set of tools and technologies for developers, not end users. This is probably why so many ETL (extract, transformation, and load) product vendors have ported their products to execute within a Hadoop environment. It only makes sense to migrate processing from a specialized platform to commodity hardware. Why bog down and over invest in your DW platform if you can handle the heavy lifting of transformation processing on a less expensive platform?
Introducing a new system to your DW environment will inevitably create new work for your DW architects and developers. However, the benefits are likely to be significant. While some might view such an endeavor as a creative way to justify purchasing new hardware and installing Hadoop, the real reason is to extend the life of the data warehouse (and save your company a bunch of money by deferring a DW upgrade)
IT organizations have spent enormous sums of money over the past 10-15 years attacking productivity. They’ve acquiring data integration tools, implemented improved development methodologies, and even reengineered requirements gathering methods to ensure business priority alignment. And the result of all of this investment? Today’s data integration developers are easily 10x to 20x more productive than the COBOL programmers of the past. This shouldn’t be a surprise to anyone – writing, compiling, linking, and testing 3rd generation code is much slower than today’s GUI-based, drag-and-drop development tools. The tools work; developers are faster, quicker, and better.
So, why does it still seem to take an eternity and cost a fortune to acquire and integrate new data into an existing report? The bottleneck has moved upstream: finding and extracting source data is complicated and time consuming. We’ve invested in our Integration Competency Centers to create an assembly line to streamline the process of transforming and converting data that is loaded into databases or applications. Unfortunately, we’ve not devoted any effort in simplifying access or understanding the actual raw source data that feeds the assembly line.
Henry Ford didn’t invent the assembly line, he revolutionized it. One of the changes that he introduced to the assembly line was simplifying and standardizing parts and the actual assembly process. Prior to Ford’s assembly line, car assembly was a custom effort that required highly trained craftsmen to shape, tool, and fit parts by hand (in a very time consuming process). The parts weren’t always uniform, so the craftsmen had to spend a significant amount of time fitting the parts together.
In most IT environments, source system access and data content varies across the different application systems dramatically. This forces developers to become data craftsmen in order to deal with the data idiosyncrasies associated with the numerous source systems common to most companies. Every system stores data in a custom and unique manner; it takes a lot of time to search and analyze source system data in order to identify the necessary content. (A popular ERP package stores its details in more than 10,000 tables) So, each new request often requires developers to create “from scratch” code to access and manipulate new data from a source system. If you dig a bit, you’ll probably find that many of your application systems generate dozens or hundreds (yes, hundreds) of custom extracts to deliver data to support the various production business needs within your company.
While most folks might think that custom extracts are a reasonably decent solution, they’re not. In fact, they’re a problem that will only get worse with time. (Remember, every extract requires development time and ongoing support.) You’ll be better off consolidating all of those extracts into a single set that includes all of the data. This will reduce processing time, reduce storage, reduce maintenance, and ultimately save a lot of money. You’ll have to spend some time designing and building these new extracts and getting folks to migrate to using them, but the benefits will be significant. (One of my clients was able to defer a platform upgrade due to the CPU and storage reduction caused by the consolidation and removal of all of the custom extracts).
Standardizing source data to reduce the data craftsmen problem isn’t rocket science, but it’s more than simply creating a data dump or generating a backup file. You need to deliver data in a manner that can be quickly and easily consumed by other systems. This means that the content needs to be reformatted from the unique (sometimes indecipherable) format of the host application into a format that everyone else can use. This can be easily addressed by delivering data into database tables or flat files (I know one client that delivers data in tab delimited spreadsheet format). The data should reflect the values generated by the source system in a format that everyone can understand – the content shouldn’t be modified for cleansed (this is source data, not content ready for business consumption). Delivery should occur in a frequent and regular basis along with a plan for archiving a decent amount of history.
This isn’t a new concept; this was a common approach in the days when custom coded IBM mainframe applications were all the rage. Back then, data sharing was a priority and every application generated standard extracts to reduce I/O and storage costs. There was also an extreme sensitivity to developer time. Requesting a custom extract was frowned upon and rarely approved. Finding and accessing the data was as simple as referencing the extract files that were made available from every application system.
When it comes to improving the delivery speed of new data to business users, maybe we can learn something from Henry Ford and the world of mainframe development.
Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
In the motion picture industry, studios separate responsibilities for creating content from responsibilities for distributing content. The people who make the movies option the scripts, hire the talent, and film the scenes. The distributors of the films, on the other hand, figure out how to package and deploy the films. They need to know which theaters require 30 millimeter versus 70 millimeter formats, or even IMAX. They also deal with DVD packaging, including different international DVD formats. The industry understands the importance of having a supply chain that differentiates between the roles of content creation, content packaging, and distribution.
In IT we’re very quick to point to our operational systems as creators and owners of data. But maybe the solution is that IT establishes a functional team that’s responsible for data packaging and distribution, just like the movie industry.
Traditionally data formats and standards have fallen into the realm of the architecture team. Unfortunately this is typically a paper-only activity without teeth. A data distribution team wouldn’t focus on paperwork. They would be focused on data logistics, receiving content from the various source systems and packaging the data for consumption by other systems. This isn’t about implementing a specific platform to store or move data. It’s about active management of corporate data content.
One of the biggest development challenges is the hunting expedition that developers go on to find and acquire the data they need. Most aren’t aware of all their choices, let alone the optimal systems of record.
Currently every application, data mart, data warehouse, reporting system that needs data from another system follows a specific set of procedures to obtain that data. Each system requests different data formats, different delivery schedules, and different content. Everything is custom, there are few if any standards, and there are no economies of scale.
This will also unburden the various application teams from building and maintaining the never ending volume of custom extract requests. The only way to stop the madness is to compartmentalize content creation from data packaging and distribution. This means establishing a data supply chain that separates data creators from data distribution from consumers. Who knew IT infrastructure was just like the movies?
In my last blog post, I described the reality of so-called analytical data integration, which is really just a fancy name for ETL. Now let's talk about so-called operational data integration. I'm assuming that when the vendors talk about this, it's the same thing as "data integration for operational systems." Most business applications use point-to-point solutions to retrieve and integrate data for their own specific processing needs. This is ETL in reverse: it's a "pull" process as opposed to a "push" process.
Unfortunately this involves a lot of duplicate processing for people to access individual records from source systems. And like their analytical brethren, the moment a source system changes, there is exponential work necessary to support the new modification. Multiply this by thousands of data elements and dozens of source systems, you’ll find a farm of silos and hundreds (if not thousands) of data integration jobs. It's not an uncommon problem.
In most BI environments we begin with a large batch data movement process. We build our ETL so it can occur overnight. But our data volumes are such that overnight isn’t enough. So the next evolution is building "trickle load" ETL. The issue here is that data integration is less about how the data is used as it is when the data is needed and the level of data quality. Most operational systems don’t clean the data, they just move it. And most ETL jobs for data warehouses will standardize the formatting but they won’t change the values. (And if they do fix the values, they don’t communicate those changes back to the source systems.)
If I have specialized data needs I should be building specialized integration logic. If I have commodity or standard needs for data that everyone uses, the data should be highly cleansed.
So it's not about analytical versus operational data integration. It's not even about how the data is used. It's really about one-way versus bi-directional data provisioning. As usual, the word integration is used too loosely. In either case, the presumption that the target is a relational database is naïve. And whether it's for analytical or operational integration is beside the point.
I’ve been hearing a bit lately on the difference between “analytical data integration” and “operational data integration.” I don’t agree with the distinction any more than I agree with analytical versus operational MDM. In this blog post, I’ll characterize analytical data integration. Warning: It won’t be pretty. In my next one, I’ll take on operational data integration (ditto).
The analytics folks build their own specialized ETL jobs to pull data from operational systems and business applications and often ignore data cleansing, transforming the data on their own particular needs. Most of the time, this is a custom activity. Each time there’s a new report or data mart, new ETL development occurs.
It’s important to realize that data integration is not just about moving data between databases: it’s about moving and merging multiple data sources independent of their format or function. We’re talking more than just relational databases here: we’re talking applications, flat files, objects, APIs, data services (SOA), hierarchical structures, and dozens of others.
Everyone acknowledges that this work consumes about 40 percent of the overall cost of the analytical program. Stovepipe data maintenance activities are rampant, and wasteful. In reality, a lot of ETL work involves a depressing amount of duplicate effort. It’s rare that a business application doesn’t already have at least one piece of ETL written against it. The urge to operationally integrate data can be seen as a remedy for this. But is it really?