In my last blog post, I described the reality of so-called analytical data integration, which is really just a fancy name for ETL. Now let's talk about so-called operational data integration. I'm assuming that when the vendors talk about this, it's the same thing as "data integration for operational systems." Most business applications use point-to-point solutions to retrieve and integrate data for their own specific processing needs. This is ETL in reverse: it's a "pull" process as opposed to a "push" process.
Unfortunately this involves a lot of duplicate processing for people to access individual records from source systems. And like their analytical brethren, the moment a source system changes, there is exponential work necessary to support the new modification. Multiply this by thousands of data elements and dozens of source systems, you’ll find a farm of silos and hundreds (if not thousands) of data integration jobs. It's not an uncommon problem.
In most BI environments we begin with a large batch data movement process. We build our ETL so it can occur overnight. But our data volumes are such that overnight isn’t enough. So the next evolution is building "trickle load" ETL. The issue here is that data integration is less about how the data is used as it is when the data is needed and the level of data quality. Most operational systems don’t clean the data, they just move it. And most ETL jobs for data warehouses will standardize the formatting but they won’t change the values. (And if they do fix the values, they don’t communicate those changes back to the source systems.)
If I have specialized data needs I should be building specialized integration logic. If I have commodity or standard needs for data that everyone uses, the data should be highly cleansed.
So it's not about analytical versus operational data integration. It's not even about how the data is used. It's really about one-way versus bi-directional data provisioning. As usual, the word integration is used too loosely. In either case, the presumption that the target is a relational database is naïve. And whether it's for analytical or operational integration is beside the point.
I’ve been hearing a bit lately on the difference between “analytical data integration” and “operational data integration.” I don’t agree with the distinction any more than I agree with analytical versus operational MDM. In this blog post, I’ll characterize analytical data integration. Warning: It won’t be pretty. In my next one, I’ll take on operational data integration (ditto).
The analytics folks build their own specialized ETL jobs to pull data from operational systems and business applications and often ignore data cleansing, transforming the data on their own particular needs. Most of the time, this is a custom activity. Each time there’s a new report or data mart, new ETL development occurs.
It’s important to realize that data integration is not just about moving data between databases: it’s about moving and merging multiple data sources independent of their format or function. We’re talking more than just relational databases here: we’re talking applications, flat files, objects, APIs, data services (SOA), hierarchical structures, and dozens of others.
Everyone acknowledges that this work consumes about 40 percent of the overall cost of the analytical program. Stovepipe data maintenance activities are rampant, and wasteful. In reality, a lot of ETL work involves a depressing amount of duplicate effort. It’s rare that a business application doesn’t already have at least one piece of ETL written against it. The urge to operationally integrate data can be seen as a remedy for this. But is it really?