Because the idea of building a data strategy is a fairly new concept in the world of business and information technology (IT), there’s a fair amount of discussion about the pieces and parts that comprise a Data Strategy. Most IT organizations have invested heavily in developing plans to address platforms, tools, and even storage. Those IT plans are critical in managing systems and capturing and retaining content generated by a company’s production applications. Unfortunately, those details don’t typically address all of the data activities that occur after an application has created and processed data from the initial business process. The reasons that folks take on the task of developing a Data Strategy is because of the challenges in finding, identifying, sharing, and using data. In any company, there are numerous roles and activities involved in delivering data to support business processing and analysis. A successful Data Strategy must support the breadth of activities necessary to ensure that data is “ready to use”.
There are five core components in a data strategy that work together as building blocks to address the various details necessary to comprehensively support the management and usage of data.
Identify The ability to identify data and understand its meaning regardless of its structure, origin, or location.
This concept is pretty obvious, but it’s likely one of the biggest obstacles in data usage and sharing. All too often, companies have multiple and different terms for specific business details (customer: account, client, patron; income: earnings, margin, profit). In order to analyze, report, or use data, people need to understand what it’s called and how to identify it. Another aspect of Identify is establishing the representation of the data’s value (Are the company’s geographic locations represented by name, number, or an abbreviation?) A successful Data Strategy would identify the gaps and needs in this area and identify the necessary activities and artifacts required to standardize data identification and representation.
Provision Enabling data to be packaged and made available while respecting all rules and access guidelines.
Data is often shared or made available to others at the convenience of the source system’s developers. The data is often accessible via database queries or as a series of files. There’s rarely any uniformity across systems or subject areas, and usage requires programming level skills to analyze and inventory the contents of the various tables or files. Unfortunately, the typical business person requiring data is unlikely to possess sophisticated programming and data manipulation skills. They don’t want raw data (that reflects source system formats and inaccuracies), they want data that is uniformly formatted and documented that is ready to be added to their analysis activities.
The idea of Provision is to package and provide data that is “ready to use”. A successful Data Strategy would identify the various data sharing needs and identify the necessary methods, practices, and tooling required to standardize data packaging and sharing.
Store Persisting data in a structure and location that supports access and processing across the enterprise.
Most IT organizations have solid plans for addressing this area of a Data Strategy. It’s fairly common for most companies to have a well-defined set of methods to determine the platform where online data is stored and processed, how data is archived for disaster recovery, and all of the other details such as protection, retention, and monitoring.
As the technology world has evolved, there are other facets of this area that require attention. The considerations include managing data distributed across multiple locations (the cloud, premise systems, and even multiple desktops), privacy and protection, and managing the proliferation of copies. With the emergence of new consumer privacy laws, it’s risky to store multiple copies of data, and it’s become necessary to track all existing copies of content. A successful Data Strategy ensures that any created data is always available for future access without requiring everyone to create their own copy.
Process Standardizing, combining, and moving data residing in multiple locations and providing a unified view.
It’s no secret that data integration is one of the more costly activities occurring within an IT organization; nearly 40% of the cost of new development is consumed by data integration activities. And Process isn’t limited to integration, it also includes correcting, standardizing, and formatting the content to make it “ready to use”.
With the growth of analytics and desktop decisioning making, the need to continually analyze and include new data sets into the decision-making process has exploded. Processing (or preparing or wrangling) data is no longer confined to the domain of the IT organization, it has become an end user activity. A successful Data Strategy had to ensure that all users can be self-sufficient in their abilities to process data.
Govern Establishing and communicating information rules, policies, and mechanisms to ensure effective data usage.
While most organizations are quick to identify their data as a core business asset, few have put the necessary rigor in place to effectively manage data. Data Governance is about establishing rules, policies, and decision mechanisms to allow individuals to share and use data in a manner that respects the various (legal and usage) guidelines associated with that data. The inevitable challenge with Data Governance is adoption by the entire data supply chain – from application developers to report developers to end users. Data Governance isn’t a user-oriented concept, it’s a data-oriented concept. A successful Data Strategy identifies the rigor necessary to ensure a core business asset is managed and used correctly.
The 5 Components of a Data Strategy is a framework to ensure that all of a company’s data usage details are captured and organized and that nothing is unknowingly overlooked. A successful Data Strategy isn’t about identifying every potential activity across the 5 different components. It’s about making sure that all of the identified solutions to the problems in accessing, sharing, and using data are reviewed and addressed in a thorough manner.
I was recently asked about my opinion for the potential of Hadoop replacing a company’s data warehouse (DW). While there’s lots to be excited about when it comes to Hadoop, I’m not currently in the camp of folks that believe it’s practical to use Hadoop to replace a company’s DW. Most corporate DW systems are based on commercial relational database products and can store and manage multiple terabytes of data and support hundreds (if not thousands) of concurrent users. It’s fairly common for these systems to handle complex, mixed workloads –queries processing billions of rows across numerous tables along with simple primary key retrieval requests all while continually loading data. The challenge today is that Hadoop simply isn’t ready for this level of complexity.
All that being said, I do believe there’s a huge opportunity to use Hadoop to replace a significant amount of processing that is currently being handled by most DWs. Oh, and data warehouse user won’t be affected at all.
Let’s review a few fundamental details about the DW. There’s two basic data processing activities that occur on a DW: query processing and transformation processing. Query processing is servicing the SQL that’s submitted from all of the tools and applications on the users’ desktops, tablets, and phones. Transformation processing is the workload involved with converting data from their source application formats to the format required by the data warehouse. While the most visible activity to business users is query processing, it is typically the smaller of the two. Extracting and transforming the dozens (or hundreds) of source data files for the DW is a huge processing activity. In fact, most DWs are not sized for query processing; they are sized for the daily transformation processing effort.
It’s important to realize that one of the most critical service level agreements (SLAs) of a DW is data delivery. Business users want their data first thing each morning. That means the DW has to be sized to deliver data reliably each and every business morning. Since most platforms are anticipated to have a 3+ year life expectancy, IT has to size the DW system based on the worst case data volume scenario for that entire period (end of quarter, end of year, holidays, etc.) This means the DW is sized to address a maximum load that may only occur a few times during that entire period.
This is where the opportunity for Hadoop seems pretty obvious. Hadoop is a parallel, scalable framework that handles distributed batch processing and large data volumes. It’s really a set of tools and technologies for developers, not end users. This is probably why so many ETL (extract, transformation, and load) product vendors have ported their products to execute within a Hadoop environment. It only makes sense to migrate processing from a specialized platform to commodity hardware. Why bog down and over invest in your DW platform if you can handle the heavy lifting of transformation processing on a less expensive platform?
Introducing a new system to your DW environment will inevitably create new work for your DW architects and developers. However, the benefits are likely to be significant. While some might view such an endeavor as a creative way to justify purchasing new hardware and installing Hadoop, the real reason is to extend the life of the data warehouse (and save your company a bunch of money by deferring a DW upgrade)
I read an interesting tidbit about data the other day: the United States Postal Service processed more than 47 million changes of addresses in the last year. That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality. The idea of stale data is that as data ages, its accuracy and associated business rules can change.
There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality: people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change. Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value. The challenge with maintaining any asset is establishing a reasonable maintenance plan.
Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention. In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner. Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.) This happens so often, most data warehouse folks have even given it a name: Recasting History. Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.
- Establish a regularly scheduled data maintenance window. Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
- Collect and publish data quality details. Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
- Keep the original data. Most data quality processing overwrites original content with new details. Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
- Add source system identification and creation date/time details to every record. While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
- Schedule a regular data change control meeting. This too is similar in concept to the change control meeting associated with IT operations teams. This is a forum for discussing data content issues and changes
Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun. Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting. They simply don’t have the resources or time to address data maintenance as a proactive activity.
Business users clamor for new data and new reports; new funding is always tied to new business capabilities. Support costs are budgeted, but they’re focused on software and hardware maintenance activities. No one ever considers data maintenance; it’s simply ignored and forgotten.
Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement. And where are we investing in maintenance? The commodity tools, not the strategic corporate asset.
Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).
In one of my previous blogs, I wrote about Data Virtualization technology — one of the more interesting pieces of middleware technology that can simplify data management. While most of the commercial products in this space share a common set of features and functions, I thought I’d devote this blog to discussing the more advanced features. There are quite a few competing products; the real challenge in differentiating the products is to understand their more advanced features.
The attraction of data virtualization is that it simplifies data access. Most IT shops have one of everything – and this includes several different brands of commercial DBMSs, a few open source databases, a slew of BI/reporting tools, and the inevitable list of emerging and specialized tools and technologies (Hadoop, Dremel, Casandra, etc.) Supporting all of the client-to-server-to-repository interfaces (and the associated configurations) is both complex and time consuming. This is why the advanced capabilities of Data Virtualization have become so valuable to the IT world.
The following details aren’t arranged in any particular order. I’ve identified the ones that I’ve found to be the most valuable (and interesting). Let me also acknowledge not every DV product supports all of these features.
Intelligent data caching. Repository-to-DV Server data movement is the biggest obstacle in query response time. Most DV products are able to support static caching to reduce repetitive data movement (data is copied and persisted in the DV Server). Unfortunately, this approach has limited success when there are ad hoc users accessing dozens of sources and thousands of tables. The more effective solution is for the DV Server to monitor all queries and dynamically cache data based on user access, query load, and table (and data) access frequency.
Query optimization (w/multi-platform execution). While all DV products claim some amount of query optimization, it’s important to know the details. There are lots of tricks and techniques; however, look for optimization that understands source data volumes, data distribution, data movement latency, and is able to process data on any source platform.
Support for multiple client Interfaces. Since most companies have multiple database products, it can be cumbersome to support and maintain multiple client access configurations. The DV server can act as a single access point for multiple vendor products (a single ODBC interface can replace drivers for each DBMS brand). Additionally, most DV Server drivers support multiple different access methods (ODBC, JDBC, XML, and web services).
Attribute level or value specific data security. This feature supports data security at a much lower granularity than is typically available with most DBMS products. Data can be protected (or restricted) at individual column values for entire table or selective rows.
Metadata tracking and management. Since Data Virtualization is a query-centric middleware environment, it only makes sense to position this server to retrieve, reconcile, and store metadata content from multiple, disparate data repositories.
Data lineage. This item works in tandem with the metadata capability and augments the information by retaining the source details for all data that is retrieved. This not only includes source id information for individual records but also the origin, creation date, and native attribute details.
Query tracking for usage audit. Because the DV Server can act as a centralized access point for user tool access, there are several DV products that support the capture and tracking of all submitted queries. This can be used to track, measure, and analyze end user (or repository) access.
Workflow linkage and processing. This is the ability to execute predefined logic against specific data that is retrieved. While this concept is similar to a macro or stored procedure, it’s much more sophisticated. It could include the ability to direct job control or specialized processing against an answer set prior to delivery (e.g. data hygiene, external access control, stewardship approval, etc.)
Packaged Application Templates. Most packaged applications (CRM, ERP, etc.) contain thousands of tables and columns that can be very difficult to understand and query. Several DV vendors have developed templates containing predefined DV server views that access the most commonly queried data elements.
Setup and Configuration Wizards. Configuring a DV server to access the multiple data sources can be a very time consuming exercise; the administrator needs to define and configure every source repository, the underlying tables (or files), along with the individual data fields. To simplify setup, a configuration wizard reviews the dictionary of an available data source and generates the necessary DV Server configuration details. It further analyzes the table and column names to simplify naming conventions, joins, and data value conversion and standardization details.
Don’t be misled into thinking that Data Virtualization is a highly mature product space where all of the products are nearly identical. They aren’t. Most product vendors spend more time discussing their unique features instead of offering metrics about their their core features. It’s important to remember that every Data Virtualization product requires a server that retrieves and processes data to fulfill query requests. This technology is not a commodity, which means that details like setup/configuration time, query performance, and advanced features can vary dramatically across products. Benchmark and test drive the technology before buying.
IT organizations have spent enormous sums of money over the past 10-15 years attacking productivity. They’ve acquiring data integration tools, implemented improved development methodologies, and even reengineered requirements gathering methods to ensure business priority alignment. And the result of all of this investment? Today’s data integration developers are easily 10x to 20x more productive than the COBOL programmers of the past. This shouldn’t be a surprise to anyone – writing, compiling, linking, and testing 3rd generation code is much slower than today’s GUI-based, drag-and-drop development tools. The tools work; developers are faster, quicker, and better.
So, why does it still seem to take an eternity and cost a fortune to acquire and integrate new data into an existing report? The bottleneck has moved upstream: finding and extracting source data is complicated and time consuming. We’ve invested in our Integration Competency Centers to create an assembly line to streamline the process of transforming and converting data that is loaded into databases or applications. Unfortunately, we’ve not devoted any effort in simplifying access or understanding the actual raw source data that feeds the assembly line.
Henry Ford didn’t invent the assembly line, he revolutionized it. One of the changes that he introduced to the assembly line was simplifying and standardizing parts and the actual assembly process. Prior to Ford’s assembly line, car assembly was a custom effort that required highly trained craftsmen to shape, tool, and fit parts by hand (in a very time consuming process). The parts weren’t always uniform, so the craftsmen had to spend a significant amount of time fitting the parts together.
In most IT environments, source system access and data content varies across the different application systems dramatically. This forces developers to become data craftsmen in order to deal with the data idiosyncrasies associated with the numerous source systems common to most companies. Every system stores data in a custom and unique manner; it takes a lot of time to search and analyze source system data in order to identify the necessary content. (A popular ERP package stores its details in more than 10,000 tables) So, each new request often requires developers to create “from scratch” code to access and manipulate new data from a source system. If you dig a bit, you’ll probably find that many of your application systems generate dozens or hundreds (yes, hundreds) of custom extracts to deliver data to support the various production business needs within your company.
While most folks might think that custom extracts are a reasonably decent solution, they’re not. In fact, they’re a problem that will only get worse with time. (Remember, every extract requires development time and ongoing support.) You’ll be better off consolidating all of those extracts into a single set that includes all of the data. This will reduce processing time, reduce storage, reduce maintenance, and ultimately save a lot of money. You’ll have to spend some time designing and building these new extracts and getting folks to migrate to using them, but the benefits will be significant. (One of my clients was able to defer a platform upgrade due to the CPU and storage reduction caused by the consolidation and removal of all of the custom extracts).
Standardizing source data to reduce the data craftsmen problem isn’t rocket science, but it’s more than simply creating a data dump or generating a backup file. You need to deliver data in a manner that can be quickly and easily consumed by other systems. This means that the content needs to be reformatted from the unique (sometimes indecipherable) format of the host application into a format that everyone else can use. This can be easily addressed by delivering data into database tables or flat files (I know one client that delivers data in tab delimited spreadsheet format). The data should reflect the values generated by the source system in a format that everyone can understand – the content shouldn’t be modified for cleansed (this is source data, not content ready for business consumption). Delivery should occur in a frequent and regular basis along with a plan for archiving a decent amount of history.
This isn’t a new concept; this was a common approach in the days when custom coded IBM mainframe applications were all the rage. Back then, data sharing was a priority and every application generated standard extracts to reduce I/O and storage costs. There was also an extreme sensitivity to developer time. Requesting a custom extract was frowned upon and rarely approved. Finding and accessing the data was as simple as referencing the extract files that were made available from every application system.
When it comes to improving the delivery speed of new data to business users, maybe we can learn something from Henry Ford and the world of mainframe development.
Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
At the recent Gartner MDM Summit in Las Vegas I was approached at least a half a dozen times by people wondering what MDM vendor to choose. I gave my usual response, which was, “What are you trying to accomplish?”
Normally a (short) conversation ensues of functions, feeds and speeds, which then leads to my next question, “So, what are your priorities and decision criteria? The responses were all the same, and I have to admit that they surprised me.
“We know we need MDM, but our company hasn’t really decided what MDM is. Since we’re already a [Microsoft / IBM / SAP / Oracle / SAS] shop, we just thought we’d buy their product…so what do you think of their product?”
I find this type of question interesting and puzzling. Why would anyone blindly purchase a product because of the vendor, rather than focusing on needs, priorities, and cost metrics? Unless a decision has absolutely no risk or cost, I’m not clear how identifying a vendor before identifying the requirements could possibly have a successful outcome.
If I look in my refrigerator, not all my products have the same brand label. My taste, interests, and price tolerance vary based upon the product. My catsup comes from one company, my salad dressing comes from another, and I have about seven different types of mustard (long story). Likewise, my TV, DVD player, surround sound system, DVR, and even my remote control are all different brands. Despite the advertisers’ claims, no single company has the best feature set across all products. For those of you who are loyal to a single brand, you can stop reading now. I’m sure you think I’m nuts.
The fact is that different vendors have different strengths, and this causes their products to differ. Buyers of these products should focus on their requirements and needs, not the product’s functions and features. Somehow this type of logic seems to escape otherwise smart business people. A good decision can deliver enormous benefits to a company; a bad decision can deliver enormous benefits to a company’s competitors.
What other reason would there be for someone saying, “We’re a [vendor name here] shop?” Examples abound of vendors abandoning products. IBM’s Intelligent Miner data mining tool, OS/2, the Apple Newton, Microsoft Money are but a few of the many examples.
Working with a reputable vendor is smart. Gathering requirements, reviewing product features, and determining the best match creates the opportunity for developing a client/vendor partnership. So why would anyone throw all of that out and just decide to pick a vendor? I guess lots of folks thought that Bernie Madoff was their partner. Need I say more?
photo by xJasonRogersx via Flickr (Creative Common License)
I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.
In the motion picture industry, studios separate responsibilities for creating content from responsibilities for distributing content. The people who make the movies option the scripts, hire the talent, and film the scenes. The distributors of the films, on the other hand, figure out how to package and deploy the films. They need to know which theaters require 30 millimeter versus 70 millimeter formats, or even IMAX. They also deal with DVD packaging, including different international DVD formats. The industry understands the importance of having a supply chain that differentiates between the roles of content creation, content packaging, and distribution.
In IT we’re very quick to point to our operational systems as creators and owners of data. But maybe the solution is that IT establishes a functional team that’s responsible for data packaging and distribution, just like the movie industry.
Traditionally data formats and standards have fallen into the realm of the architecture team. Unfortunately this is typically a paper-only activity without teeth. A data distribution team wouldn’t focus on paperwork. They would be focused on data logistics, receiving content from the various source systems and packaging the data for consumption by other systems. This isn’t about implementing a specific platform to store or move data. It’s about active management of corporate data content.
One of the biggest development challenges is the hunting expedition that developers go on to find and acquire the data they need. Most aren’t aware of all their choices, let alone the optimal systems of record.
Currently every application, data mart, data warehouse, reporting system that needs data from another system follows a specific set of procedures to obtain that data. Each system requests different data formats, different delivery schedules, and different content. Everything is custom, there are few if any standards, and there are no economies of scale.
This will also unburden the various application teams from building and maintaining the never ending volume of custom extract requests. The only way to stop the madness is to compartmentalize content creation from data packaging and distribution. This means establishing a data supply chain that separates data creators from data distribution from consumers. Who knew IT infrastructure was just like the movies?
I frequently describe MDM as subject area data integration. The whole point of mastering and managing data is to simplify data sharing, since confusion only occurs when you have two or more instances of data and it doesn’t match. It’s important to realize that mastering data isn’t really necessary if you only have a single system that contains one copy of data. After all, how much confusion or misunderstanding can occur when there’s only one copy of data? The challenge in making data more usable and easy to understand exists because most companies have multiple application systems each with their own copy of data (and their own “version of truth”). MDM’s promise is to deliver a single view of subject area data. In our book, Customer Data Integration: Reaching a Single Version of the Truth (John Wiley & Sons, 2006), Jill Dyché and I defined MDM as:
“The set of disciplines and methods to ensure the currency, meaning, and quality of a company’s reference data that is shared across various systems and organizations.”
As companies have grown, so to have the number of systems that require access to each other’s data. This is why data integration has become one of the largest custom development activities undertaken within an IT organization. It’s rare that all systems (and their developers) integrate data the same way. While there may be rigor within an individual application or system, it’s highly unlikely that all systems manipulate an individual subject area in a consistent fashion. This lack of integrity and consistency becomes visible when information on two different systems conflict. MDM isn’t a silver bullet to address this problem. It is a method to address data problems one subject area at a time.
The reason for establishing a boundary around subject area is because the complexity, rules, and usage of data within most organization tend to differ by subject area. Examples of subject areas include customer, product, and supplier. There can be literally dozens if not hundreds subject areas within any given company.
Figure 1: Different Data Subject Areas
Do you need to master every subject area? Probably not. MDM projects focus on subject areas that suffer the most from inaccuracies, mistakes, and misunderstandings, for instance, customers with inaccurate identification numbers, products missing descriptive information, or an employee with an inaccurate start date. The idea behind master data management is to establish rules, guidelines, and rigor for subject areas data.
The rules associated with identifying a customer are typically well defined within a company. The rules associated with adding a new product to the sales catalog are also well defined. The thing to keep in mind is that the rules associated with product will have nothing to do with customers. Additionally, most companies have rules that limit what customer data can be modified. They also have rules that restrict how product information can be manipulated.. The idea behind MDM is to manage these rules and methods in a manner where all application systems manipulate reference data in a consistent way.
Implementing MDM isn’t just about building and deploying a server that contains the “master list” of reference data; that’s the easy part. MDM’s real challenge is integrating the functionality into the multitude of application systems that exist within a company. The idea is that when a new customer is added, all systems are aware of the change and have equal access to that data.
For instance, one of the most universal challenges in business today is managing a customer’s marketing preferences. When a customer asks to opt out of all marketing communications, it’s important that all systems are aware of this choice. Problems typically occur when a particular data element can be modified from multiple different locations (e.g., a web page, an 800 number, or even the US Postal Service). MDM provides the solution for ensuring that the master data is managed correctly and that all systems become aware of the change (and the new data) in a manner that supports the businesses needs.