I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.
In the motion picture industry, studios separate responsibilities for creating content from responsibilities for distributing content. The people who make the movies option the scripts, hire the talent, and film the scenes. The distributors of the films, on the other hand, figure out how to package and deploy the films. They need to know which theaters require 30 millimeter versus 70 millimeter formats, or even IMAX. They also deal with DVD packaging, including different international DVD formats. The industry understands the importance of having a supply chain that differentiates between the roles of content creation, content packaging, and distribution.
In IT we’re very quick to point to our operational systems as creators and owners of data. But maybe the solution is that IT establishes a functional team that’s responsible for data packaging and distribution, just like the movie industry.
Traditionally data formats and standards have fallen into the realm of the architecture team. Unfortunately this is typically a paper-only activity without teeth. A data distribution team wouldn’t focus on paperwork. They would be focused on data logistics, receiving content from the various source systems and packaging the data for consumption by other systems. This isn’t about implementing a specific platform to store or move data. It’s about active management of corporate data content.
One of the biggest development challenges is the hunting expedition that developers go on to find and acquire the data they need. Most aren’t aware of all their choices, let alone the optimal systems of record.
Currently every application, data mart, data warehouse, reporting system that needs data from another system follows a specific set of procedures to obtain that data. Each system requests different data formats, different delivery schedules, and different content. Everything is custom, there are few if any standards, and there are no economies of scale.
This will also unburden the various application teams from building and maintaining the never ending volume of custom extract requests. The only way to stop the madness is to compartmentalize content creation from data packaging and distribution. This means establishing a data supply chain that separates data creators from data distribution from consumers. Who knew IT infrastructure was just like the movies?
I frequently describe MDM as subject area data integration. The whole point of mastering and managing data is to simplify data sharing, since confusion only occurs when you have two or more instances of data and it doesn’t match. It’s important to realize that mastering data isn’t really necessary if you only have a single system that contains one copy of data. After all, how much confusion or misunderstanding can occur when there’s only one copy of data? The challenge in making data more usable and easy to understand exists because most companies have multiple application systems each with their own copy of data (and their own “version of truth”). MDM’s promise is to deliver a single view of subject area data. In our book, Customer Data Integration: Reaching a Single Version of the Truth (John Wiley & Sons, 2006), Jill Dyché and I defined MDM as:
“The set of disciplines and methods to ensure the currency, meaning, and quality of a company’s reference data that is shared across various systems and organizations.”
As companies have grown, so to have the number of systems that require access to each other’s data. This is why data integration has become one of the largest custom development activities undertaken within an IT organization. It’s rare that all systems (and their developers) integrate data the same way. While there may be rigor within an individual application or system, it’s highly unlikely that all systems manipulate an individual subject area in a consistent fashion. This lack of integrity and consistency becomes visible when information on two different systems conflict. MDM isn’t a silver bullet to address this problem. It is a method to address data problems one subject area at a time.
The reason for establishing a boundary around subject area is because the complexity, rules, and usage of data within most organization tend to differ by subject area. Examples of subject areas include customer, product, and supplier. There can be literally dozens if not hundreds subject areas within any given company.
Figure 1: Different Data Subject Areas
Do you need to master every subject area? Probably not. MDM projects focus on subject areas that suffer the most from inaccuracies, mistakes, and misunderstandings, for instance, customers with inaccurate identification numbers, products missing descriptive information, or an employee with an inaccurate start date. The idea behind master data management is to establish rules, guidelines, and rigor for subject areas data.
The rules associated with identifying a customer are typically well defined within a company. The rules associated with adding a new product to the sales catalog are also well defined. The thing to keep in mind is that the rules associated with product will have nothing to do with customers. Additionally, most companies have rules that limit what customer data can be modified. They also have rules that restrict how product information can be manipulated.. The idea behind MDM is to manage these rules and methods in a manner where all application systems manipulate reference data in a consistent way.
Implementing MDM isn’t just about building and deploying a server that contains the “master list” of reference data; that’s the easy part. MDM’s real challenge is integrating the functionality into the multitude of application systems that exist within a company. The idea is that when a new customer is added, all systems are aware of the change and have equal access to that data.
For instance, one of the most universal challenges in business today is managing a customer’s marketing preferences. When a customer asks to opt out of all marketing communications, it’s important that all systems are aware of this choice. Problems typically occur when a particular data element can be modified from multiple different locations (e.g., a web page, an 800 number, or even the US Postal Service). MDM provides the solution for ensuring that the master data is managed correctly and that all systems become aware of the change (and the new data) in a manner that supports the businesses needs.
I’ve noticed lately that data warehouse vendors are dusting off the arguments and pitches of days gone by. Don’t buy specialized hardware for your database needs! You’ll never be able to re-use the gear! One rep recently told a client, “With your data warehouse on our hardware, you can re-purpose the hardware at any time!”
The truth is, while data warehouse failures were rampant a few years ago, those failures are now the exception and not the rule. Data warehouses, once installed, tend to last a while. The good ones actually add more data over time and become more entrenched among user organizations. The great ones become strategic, and business people claim not to be able to do their jobs without them. A data warehouse platform is rarely for a single use, but for a multitude of needs. Data warehouses rarely just go away.
However don’t confuse an entrenched data warehouse with an entrenched data integration solution. I’ll teach a class at The Data Warehousing Institute conferences called “Architectural Options for Data Integration.” The class covers technologies like Enterprise Application Integration (EAI); Enterprise Information Integration (EII); Extract Transformation and Loading (ETL, and its sister, ELT); and Master Data Management (MDM). I present use cases for these different solutions as well as lists of the key vendors that offer them.
Attendees I talk to admit coming to the class with the intent of justifying the data warehouse as a multi-purpose integration system. They leave the class understanding the often-stark differences of these various solutions. And I hope they return to work with a different view of their future-state integration architectures, whether they re-purpose their hardware or not.
Note: Evan’s will be teaching Beyond the Data Warehouse: Architectural Options for Data Integration at the TDWI World Conference in San Diego on Thursday, August 6.
A lot of our new clients have asked us to build MDM business cases to support their merger and acquisition strategies. Specifically, they’re looking to support the following four activities:
- Recent corporate mergers
Collectively, these activities can roll up into a category called corporate restructuring. Contrary to popular belief, restructuring isn’t just a financial challenge. It includes realignment of marketing activities (for instance, reconciling promotions and re-aligning diverse product sets), sales (reorganizing territories and compensation plans), and operational issues (company locations, product inventories).
Most companies approach restructuring as a one-time-only activity in which an army of analysts tries to reconcile financial structures from organizational hierarchies, to budgets, to the accounts themselves. The fact is these activities aren’t just part of high-profile M&A events. They occur every year as companies go through their annual budget processes. During a corporate restructuring the process usually takes longer than the acquisition itself.
Three principle MDM features lend themselves to this restructuring work: matching, grouping, and linking. MDM excels at matching “like” items from disparate sources, tracking and managing hierarchies and groupings, and linking disparate data sources to enable ongoing data integration. The point is that the act of merging organizations also means consolidating details across the companies. Most people consider this a one-time-only activity. The fact is, it must be an ongoing process.
When one company buys another, it’s typical to allow the acquired company to continue to operate using the same systems and methods it always has. The acquiring company simply needs to know how to integrate the information into their existing business. Consider Berkshire Hathaway. They acquire companies frequently, but don’t change how they run their business. They simply know how to reconcile and roll up the details.
Ideally, corporate restructuring means establishing a process to allow organizations to continue their operations using their existing systems. IT systems reconciliation simply cannot get in the way of running business operations. All too often, the answer is, “Replace their systems with ours.” This statement means that the new organization should reengineer its business. This simply takes too long.
MDM provides a company the capability to link the data content from disparate systems within and across companies. I’m not talking about linking Linux with Windows, I’m talking about matching and linking business content across dozens or even hundreds of systems. This way invoices continue going out, sales people continue getting commissions, and customers can still get product support in a seamless way.
Next time you’re discussing corporate restructuring and someone says the word “re-platform,” ask the question, “If we can link and move the data to continue to support core business processes, then we wouldn’t have to disrupt our operational systems, right?” Matching and linking the data across core systems can save a lot in terms of software and labor costs. But improving it where it lays? Priceless.
In my last blog post, I described the reality of so-called analytical data integration, which is really just a fancy name for ETL. Now let's talk about so-called operational data integration. I'm assuming that when the vendors talk about this, it's the same thing as "data integration for operational systems." Most business applications use point-to-point solutions to retrieve and integrate data for their own specific processing needs. This is ETL in reverse: it's a "pull" process as opposed to a "push" process.
Unfortunately this involves a lot of duplicate processing for people to access individual records from source systems. And like their analytical brethren, the moment a source system changes, there is exponential work necessary to support the new modification. Multiply this by thousands of data elements and dozens of source systems, you’ll find a farm of silos and hundreds (if not thousands) of data integration jobs. It's not an uncommon problem.
In most BI environments we begin with a large batch data movement process. We build our ETL so it can occur overnight. But our data volumes are such that overnight isn’t enough. So the next evolution is building "trickle load" ETL. The issue here is that data integration is less about how the data is used as it is when the data is needed and the level of data quality. Most operational systems don’t clean the data, they just move it. And most ETL jobs for data warehouses will standardize the formatting but they won’t change the values. (And if they do fix the values, they don’t communicate those changes back to the source systems.)
If I have specialized data needs I should be building specialized integration logic. If I have commodity or standard needs for data that everyone uses, the data should be highly cleansed.
So it's not about analytical versus operational data integration. It's not even about how the data is used. It's really about one-way versus bi-directional data provisioning. As usual, the word integration is used too loosely. In either case, the presumption that the target is a relational database is naïve. And whether it's for analytical or operational integration is beside the point.