Data Strategy Component: Assemble
This blog is 4th in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the component Assemble and the numerous details involved with sourcing, cleansing, standardizing, preparing, integrating, and moving the data to make it ready to use.
The definition of Assemble is:
“Cleansing, standardizing, combining, and moving data residing in multiple locations and producing a unified view”
In the Data Strategy context, Assemble includes all of the activities required to transform data from its host-oriented application context to one that is “ready to use” and understandable by other systems, applications, and users.
Most data used within our companies is generated from the applications that run the company (point-of-sale, inventory management, HR systems, accounting) . While these applications generate lots of data, their focus is on executing specific business functions; they don’t exist to provide data to other systems. Consequently, the data that is generated is “raw” in form; the data reflects the specific aspects of the application (or system of origin). This often means that the data hasn’t been standardized, cleansed, or even checked for accuracy. Assemble is all of the work necessary to convert data from a “raw” state to one that is ready for business usage.
I’ve identified 5 facets to consider when developing your Data Strategy that are commonly employed to make data “ready to use”. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely want to consider different options for each. Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.
Identification and Matching
Data integration is one of the most prevalent data activities occurring within a company; it’s a basic activity employed by developers and users alike. In order to integrate data from multiple sources, it’s necessary to determine the identification values (or keys) from each source (e.g. the employee id in an employee list, the part number in a parts list). The idea of matching is aligning data from different sources with the same identification values. While numeric values are easy to identify and match (using the “=” operator), character-based values can be more complex (due to spelling irregularities, synonyms, and mistakes).
Even though it’s highly tactical, Identification and matching is important to consider within a Data Strategy to ensure that data integration is processed consistently. And one of the (main) reasons that data variances continue to exist within companies (despite their investments in platforms, tools, and repositories) is because the need for standardized Identification and Matching has not been addressed.
Survivorship
Survivorship is a pretty basic concept: the selection of the values to retain (or survive) from the different sources that are merged. Survivorship rules are often unique for each data integration process and typically determined by the developer. In the context of a data strategy, it’s important to identify the “systems of reference” because the identification of these systems provide clarity to developers and users to understand which data elements to retain when integrating data from multiple systems.
Standardize / Cleanse
The premise of data standardization and cleansing is to identify inaccurate data and correct and reformat the data to match the requirements (or the defined standards) for a specific business element. This is likely the single most beneficial process to improve the business value (and the usability) of data. The most common challenge to data standardization and cleansing is that it can be difficult to define the requirements. The other challenge is that most users aren’t aware that their company’s data isn’t standardized and cleansed as a matter of practice. Even though most companies have multiple tools to cleanup addresses, standardize descriptive details, and check the accuracy of values, the use of these tools is not common.
Reference Data
Wikipedia defines reference data as data that is used to classify or categorize other data. In the context of a data strategy, reference data is important because it ensures the consistency of data usage and meaning across different systems and business areas. Successful reference data means that details are consistently identified, represented, and formatted the same way across all aspects of the company (if the color of a widget is “RED”, then the value is represented as “RED” everywhere – not “R” in product information system, 0xFF0000 in inventory system, and 0xED2939 in product catalog). A Reference Data initiative is often aligned with a company’s data strategy initiative because of its impact to data sharing and reuse.
Movement Tracking
The idea of movement is to record the different systems that a data element touches as it travels (and is processed) after the data element is created. Movement tracking (or data lineage) is quite important when the validity and accuracy of a particular data value is questioned. And in the current era of heightened consumer data privacy and protection, the need for data lineage and tracking of consumer data within a company is becoming a requirement (and it’s the law in California and the European Union).
The dramatic increase in the quantity and diversity of data sources within most companies over the past few years has challenged even the most technology advanced organizations. It’s not uncommon to find one of the most visible areas of user frustration to be associated with accessing new (or additional) data sources. Much of this frustration occurs because of the challenge in sourcing, integrating, cleansing, and standardizing new data content to be shared with users. As is the case with all of the other components, the details are easy to understand, but complex to implement. A company’s data strategy has to evolve and change when data sharing becomes a production business requirement and users want data that is “ready to use”.
The Misunderstanding of Master Data Management
Not long ago, I was asked to review a client’s program initiative that was focused on constructing a new customer repository that would establish a single version of truth. The client was very excited about using Master Data Management (MDM) to deliver their new customer view. The problem statement was well thought out: their customer data is spread across 11 different systems; users and developers retrieve data from different sources; reports reflect conflicting details; and an enormous amount of manual effort is required to manage the data. The project’s benefits were also well thought out: increased data quality, improved reporting accuracy, and improved end user data access. And, (as you can probably imagine), the crowning objective of the project was going to be creating a Single View of the Customer. The program’s stakeholders had done a good job of communicating the details: they reviewed the existing business challenges, identified the goals and objectives, and even provided a summary of high-level requirements. They were going to house all of their customer data on an MDM hub. There was only one problem: they needed a customer data mart, not an MDM hub.
I hate the idea of discussing technical terms and details with either business or IT staff. It gets particularly uncomfortable when someone was misinformed about a new technology (and this happens all the time when vendors roll out new products to their sales force). I won’t count the number of times that I’ve seen projects implemented with the wrong technology, because the organization wanted to get a copy of the latest and greatest technical toy. A few of my colleagues and I used to call this the “bright shiny project syndrome”. While it’s perfectly acceptable to acquire a new technology to solve a problem, it can be a very expensive to purchase a technology and force fit a solution that it doesn’t easily address.
It’s frequent that folks confuse the function and purpose of Master Data Management with Data Warehousing. I suspect the core of the problem is that when folks hear about the idea of “reference data” or a “golden record”, they have this mental picture of a single platform containing all of the data. While I can’t argue with the benefit of having all the data in one place (data warehousing has been around for more than 20 years), that’s not what MDM is about. Data Warehousing became popular because of its success in storing a company’s historical data to support cross-functional (multi-subject area) analysis. MDM is different; it’s focused on reconciling and tracking a single subject area’s reference data across the multitude of systems that create that data. Some examples of a subject area include customer, product, and location.
If you look at the single biggest obstacle in data integration, it’s dealing with all of the complexity of merging data from different systems. It’s fairly common for different application systems to use different reference data (The CRM system, the Sales system, and the Billing system each use different values to identify a single customer). The only way to link data from these different systems is to compare the reference data (names, addresses, phone numbers, etc.) from each system with the hope that there are enough identical values in each to support the match. The problem with this approach is that it simply doesn’t work when a single individual may have multiple name variations, multiple addresses, and multiple phone numbers. The only reasonable solution is the use of advanced algorithms that are specially designed to support the processing and matching of specific subject area details. That’s the secret sauce of MDM – and that’s what’s contained within a commercial MDM product.
The MDM hub not only contains master records (the details identifying each individual subject area entry), it also contains a cross reference list of each individual subject area entry along with the linkage details to every other application system. And, it’s continually updated as the values change within each individual system. The idea is that an MDM hub is a high performance, transactional system focused on matching and reconciling subject area reference data. While we’ve illustrated how this capability simplifies data warehouse development, this transactional capability also enables individual application systems to move and integrate data between transactional systems more efficiently too.
The enormous breadth and depth of corporate data makes it impractical to store all of our data within a single system. It’s become common practice to prune and trim the contents of our data warehouses to limit the breadth and history of data. If you consider recent advances with big data, cloud computing, and SaaS, it becomes even more apparent that storing all of a company’s subject area data in a single place isn’t practical. That’s one of the reasons that most companies have numerous data marts and operational applications integrating and loading their own data to support their highly diverse and unique business needs. An MDM hub is focused on tracking specific subject area details across multiple systems to allow anyone to find, gather, and integrate the data they need from any system.
I recently crossed paths with the above mentioned client. Their project was wildly successful – they ended up deploying both an MDM hub and a customer data mart to address their needs. They mentioned that one of the “aha” moments that occurred during our conversation was when they realized that they needed to refocus everyone’s attention towards the business value and benefits of the project instead of the details and functions of MDM. While I was thrilled with their program’s success, I was even more excited to learn that someone was finally able to compete against the “bright shiny project syndrome” and win.
Photo “Dirt Pile 2” courtesy of CoolValley via Flickr (Creative Commons license).
Blind Vendor Allegiance Trumps Utility

At the recent Gartner MDM Summit in Las Vegas I was approached at least a half a dozen times by people wondering what MDM vendor to choose. I gave my usual response, which was, “What are you trying to accomplish?”
Normally a (short) conversation ensues of functions, feeds and speeds, which then leads to my next question, “So, what are your priorities and decision criteria? The responses were all the same, and I have to admit that they surprised me.
“We know we need MDM, but our company hasn’t really decided what MDM is. Since we’re already a [Microsoft / IBM / SAP / Oracle / SAS] shop, we just thought we’d buy their product…so what do you think of their product?”
I find this type of question interesting and puzzling. Why would anyone blindly purchase a product because of the vendor, rather than focusing on needs, priorities, and cost metrics? Unless a decision has absolutely no risk or cost, I’m not clear how identifying a vendor before identifying the requirements could possibly have a successful outcome.
If I look in my refrigerator, not all my products have the same brand label. My taste, interests, and price tolerance vary based upon the product. My catsup comes from one company, my salad dressing comes from another, and I have about seven different types of mustard (long story). Likewise, my TV, DVD player, surround sound system, DVR, and even my remote control are all different brands. Despite the advertisers’ claims, no single company has the best feature set across all products. For those of you who are loyal to a single brand, you can stop reading now. I’m sure you think I’m nuts.
The fact is that different vendors have different strengths, and this causes their products to differ. Buyers of these products should focus on their requirements and needs, not the product’s functions and features. Somehow this type of logic seems to escape otherwise smart business people. A good decision can deliver enormous benefits to a company; a bad decision can deliver enormous benefits to a company’s competitors.
What other reason would there be for someone saying, “We’re a [vendor name here] shop?” Examples abound of vendors abandoning products. IBM’s Intelligent Miner data mining tool, OS/2, the Apple Newton, Microsoft Money are but a few of the many examples.
Working with a reputable vendor is smart. Gathering requirements, reviewing product features, and determining the best match creates the opportunity for developing a client/vendor partnership. So why would anyone throw all of that out and just decide to pick a vendor? I guess lots of folks thought that Bernie Madoff was their partner. Need I say more?
photo by xJasonRogersx via Flickr (Creative Common License)
MDM Can Challenge Traditional Development Paradigms

I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.
MDM Streamlines the Supply Chain
I’ve always been a little jealous of ERP development teams. They operate on the premise that you have to standardize business processes across the enterprise. Every process feeds another process until the work is done. There are no custom processes: if you suddenly modify a business process there are upstream and downstream dependencies. Things could break.
We don’t have that luxury when we build MDM solutions for our clients. This was on my mind this past week when I was teaching my “Change Management for MDM” class in Las Vegas. The fact is that business people constantly add and modify their data. What’s important is that a consistent method exists for capturing and remediating these changes. The whole premise of MDM is that reference data changes all the time. Values are added, changed, and removed.
Let’s take the poster-child-du-jour, Toyota. Toyota has already announced that it will stop manufacturing its FJ Cruiser model in a few years. In the interest of its dealers, repair facilities, and after-market parts retailers, Toyota will need to get out in front of this change. There are catalogs to be modified, inventories to sell off, and cars to move. Likewise MDM environments can deal with data changes in advance. The hub needs to be prepared to respond to and support data changes at the right time.
We work a retailer that is constantly changing its merchandise with fluctuating purchase patterns and seasons. Adding spring merchandise to the inventory means new SKUs, new prices, and changes in product availability. Not every staff member in every store can anticipate all these new changes. Neither can the developers of the myriad operational systems. But with MDM they don’t have to keep up with all the new merchandise. The half-dozen applications that deal with inventory details can leverage the MDM hub as a clearing house of detailed changes, allowing them to be deployed in a scheduled manner according to the business calendar.
No more developers having to understand the details of hundreds of product categories and subcategories. No more one-off discussions between stores and suppliers. No more intensive manual work to change suppliers or substitute merchandise. No more updating POS systems with custom code. With MDM it’s all transparent to the applications—and to the people who use them.
Our most successful MDM engagements have confirmed what many of our clients already suspected but could never prove: that there are far more consumers of data than they knew. MDM formalizes the processes to ensure that data changes can scale to escalating volumes. It automates the communication of changes to the business areas and individuals who need to know about those changes, without needing to know each individual change.
With spring, shoppers may be thinking about new Easter outfits, gourmet items, or children’s clothes. But suppliers think about trucking capacity. Store managers can anticipate shelf and floor space requirements. Finance staff can prepare for potential product returns. Distribution center staff can allocate warehouse space. You can’t know everyone who needs the information. But the supply chain can become incredibly flexible and streamlined as a result of MDM.
And—okay, this makes me feel much better—it doesn’t even matter whether you have ERP or not!
Note: Evan will be presenting The Five Levels of MDM (and Data Governance!) Maturity next week at TDWI’s Master Data Quality and Governance Solutions Summit in Savannah, Georgia. The event is sold-out, so if you were lucky enough to get in, please stop by and say hello!
Photo by Rennett Stowe via Flickr (Creative Commons License)
MDM: Subject-Area Data Integration
I frequently describe MDM as subject area data integration. The whole point of mastering and managing data is to simplify data sharing, since confusion only occurs when you have two or more instances of data and it doesn’t match. It’s important to realize that mastering data isn’t really necessary if you only have a single system that contains one copy of data. After all, how much confusion or misunderstanding can occur when there’s only one copy of data? The challenge in making data more usable and easy to understand exists because most companies have multiple application systems each with their own copy of data (and their own “version of truth”). MDM’s promise is to deliver a single view of subject area data. In our book, Customer Data Integration: Reaching a Single Version of the Truth (John Wiley & Sons, 2006), Jill Dyché and I defined MDM as:
“The set of disciplines and methods to ensure the currency, meaning, and quality of a company’s reference data that is shared across various systems and organizations.”
As companies have grown, so to have the number of systems that require access to each other’s data. This is why data integration has become one of the largest custom development activities undertaken within an IT organization. It’s rare that all systems (and their developers) integrate data the same way. While there may be rigor within an individual application or system, it’s highly unlikely that all systems manipulate an individual subject area in a consistent fashion. This lack of integrity and consistency becomes visible when information on two different systems conflict. MDM isn’t a silver bullet to address this problem. It is a method to address data problems one subject area at a time.
The reason for establishing a boundary around subject area is because the complexity, rules, and usage of data within most organization tend to differ by subject area. Examples of subject areas include customer, product, and supplier. There can be literally dozens if not hundreds subject areas within any given company.
Figure 1: Different Data Subject Areas
Do you need to master every subject area? Probably not. MDM projects focus on subject areas that suffer the most from inaccuracies, mistakes, and misunderstandings, for instance, customers with inaccurate identification numbers, products missing descriptive information, or an employee with an inaccurate start date. The idea behind master data management is to establish rules, guidelines, and rigor for subject areas data.
The rules associated with identifying a customer are typically well defined within a company. The rules associated with adding a new product to the sales catalog are also well defined. The thing to keep in mind is that the rules associated with product will have nothing to do with customers. Additionally, most companies have rules that limit what customer data can be modified. They also have rules that restrict how product information can be manipulated.. The idea behind MDM is to manage these rules and methods in a manner where all application systems manipulate reference data in a consistent way.
Implementing MDM isn’t just about building and deploying a server that contains the “master list” of reference data; that’s the easy part. MDM’s real challenge is integrating the functionality into the multitude of application systems that exist within a company. The idea is that when a new customer is added, all systems are aware of the change and have equal access to that data.
For instance, one of the most universal challenges in business today is managing a customer’s marketing preferences. When a customer asks to opt out of all marketing communications, it’s important that all systems are aware of this choice. Problems typically occur when a particular data element can be modified from multiple different locations (e.g., a web page, an 800 number, or even the US Postal Service). MDM provides the solution for ensuring that the master data is managed correctly and that all systems become aware of the change (and the new data) in a manner that supports the businesses needs.
MDM and M&A
A lot of our new clients have asked us to build MDM business cases to support their merger and acquisition strategies. Specifically, they’re looking to support the following four activities:
- Recent corporate mergers
- Acquisitions
- Reorganizations
- Spin-offs
Collectively, these activities can roll up into a category called corporate restructuring. Contrary to popular belief, restructuring isn’t just a financial challenge. It includes realignment of marketing activities (for instance, reconciling promotions and re-aligning diverse product sets), sales (reorganizing territories and compensation plans), and operational issues (company locations, product inventories).
Most companies approach restructuring as a one-time-only activity in which an army of analysts tries to reconcile financial structures from organizational hierarchies, to budgets, to the accounts themselves. The fact is these activities aren’t just part of high-profile M&A events. They occur every year as companies go through their annual budget processes. During a corporate restructuring the process usually takes longer than the acquisition itself.
Three principle MDM features lend themselves to this restructuring work: matching, grouping, and linking. MDM excels at matching “like” items from disparate sources, tracking and managing hierarchies and groupings, and linking disparate data sources to enable ongoing data integration. The point is that the act of merging organizations also means consolidating details across the companies. Most people consider this a one-time-only activity. The fact is, it must be an ongoing process.
When one company buys another, it’s typical to allow the acquired company to continue to operate using the same systems and methods it always has. The acquiring company simply needs to know how to integrate the information into their existing business. Consider Berkshire Hathaway. They acquire companies frequently, but don’t change how they run their business. They simply know how to reconcile and roll up the details.
Ideally, corporate restructuring means establishing a process to allow organizations to continue their operations using their existing systems. IT systems reconciliation simply cannot get in the way of running business operations. All too often, the answer is, “Replace their systems with ours.” This statement means that the new organization should reengineer its business. This simply takes too long.
MDM provides a company the capability to link the data content from disparate systems within and across companies. I’m not talking about linking Linux with Windows, I’m talking about matching and linking business content across dozens or even hundreds of systems. This way invoices continue going out, sales people continue getting commissions, and customers can still get product support in a seamless way.
Next time you’re discussing corporate restructuring and someone says the word “re-platform,” ask the question, “If we can link and move the data to continue to support core business processes, then we wouldn’t have to disrupt our operational systems, right?” Matching and linking the data across core systems can save a lot in terms of software and labor costs. But improving it where it lays? Priceless.