This blog is the 2nd in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the concept of data provisioning and the various details of making data sharable.
The definition of Provision is:
“Supplying data in a sharable form while respecting all rules and access guidelines”
One of the biggest frustrations that I have in the world of data is that few organizations have established data sharing as a responsibility. Even fewer have setup the data to be ready to share and use by others. It’s not uncommon for a database programmer or report developer to have to retrieve data from a dozen different systems to obtain the data they need. And, the data arrives in different formats and files that change regularly. This lack of consistency generates large ongoing maintenance costs and requires an inordinate amount of developer time to re-transform, prepare, fix data to be used (numerous studies have found that ongoing source data maintenance can take as much of 50% of the database developers time after the initial programming effort is completed).
Should a user have to know the details (or idiosyncrasies) of the application system that created the data to use the data? (That’s like expecting someone to understand the farming of tomatoes and manufacturing process of ketchup in order to be able to put ketchup on their hamburger). The idea of Provision is to establish the necessary rigor to simplify the sharing of data.
I’ve identified 5 of the most common facets of data sharing in the illustration above – there are others. As a reminder (from last week’s blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to review the different options for each facet. Each facet can target a small organization’s issues or expand to address a diverse enterprise’s needs.
This is the most obvious aspect of provisioning: structuring and formatting the data in a clear and understandable manner to the data consumer. All too often data is packaged at the convenience of the developer instead of the convenience of the user. So, instead of sharing data as a backup file generated by an application utility in a proprietary (or binary) format, the data should be formatted so every field is labeled and formatted (text, XML) for a non-technical user to access using easily available tools. The data should also be accompanied with metadata to simplify access.
This facet works with Packaging and addresses the details associated with the data container. Data can be shared via a file, a database table, an API, or one of several other methods. While sharing data in a programmer generated file is better than nothing, a more effective approach would be to deliver data in a well-known file format (such as Excel) or within a table contained in an easily accessible database (e.g. data lake or data warehouse).
Source data stewardship is critical in the sharing of data. In this context, a Source Data Steward is someone that is responsible for supporting and maintaining the shared data content (there several different types of data stewards). In some companies, there’s a data steward responsible for the data originating from an individual source system. Some companies (focused on sharing enterprise-level content) have positioned data stewards to support individual subject areas. Regardless of the model used, the data steward tracks and communicates source data changes, monitors and maintains the shared content, and addresses support needs. This particular role is vital if your organization is undertaking any sort of data self-service initiative.
This item addresses the issues that are common in the world of electronic data sharing: inconsistency, change, and error. Acceptance checking is a quality control process that reviews the data prior to distribution to confirm that it matches a set of criteria to ensure that all downstream users receive content as they expect. This item is likely the easiest of all details to implement given the power of existing data quality and data profiling tools. Unfortunately, it rarely receives attention because of most organization’s limited experience with data quality technology.
In order to succeed in any sort of data sharing initiative, whether in supporting other developers or an enterprise data self-service initiative, it’s important to identify the audience that will be supported. This is often the facet to consider first, and it’s valuable to align the audience with the timeframe of data sharing support. It’s fairly common to focus on delivering data sharing for developers support first followed by technical users and then the large audience of business users.
In the era of “data is a business asset” , data sharing isn’t a courtesy, it’s an obligation. Data sharing shouldn’t occur at the convenience of the data producer, it should be packaged and made available for the ease of the user.
IT organizations have spent enormous sums of money over the past 10-15 years attacking productivity. They’ve acquiring data integration tools, implemented improved development methodologies, and even reengineered requirements gathering methods to ensure business priority alignment. And the result of all of this investment? Today’s data integration developers are easily 10x to 20x more productive than the COBOL programmers of the past. This shouldn’t be a surprise to anyone – writing, compiling, linking, and testing 3rd generation code is much slower than today’s GUI-based, drag-and-drop development tools. The tools work; developers are faster, quicker, and better.
So, why does it still seem to take an eternity and cost a fortune to acquire and integrate new data into an existing report? The bottleneck has moved upstream: finding and extracting source data is complicated and time consuming. We’ve invested in our Integration Competency Centers to create an assembly line to streamline the process of transforming and converting data that is loaded into databases or applications. Unfortunately, we’ve not devoted any effort in simplifying access or understanding the actual raw source data that feeds the assembly line.
Henry Ford didn’t invent the assembly line, he revolutionized it. One of the changes that he introduced to the assembly line was simplifying and standardizing parts and the actual assembly process. Prior to Ford’s assembly line, car assembly was a custom effort that required highly trained craftsmen to shape, tool, and fit parts by hand (in a very time consuming process). The parts weren’t always uniform, so the craftsmen had to spend a significant amount of time fitting the parts together.
In most IT environments, source system access and data content varies across the different application systems dramatically. This forces developers to become data craftsmen in order to deal with the data idiosyncrasies associated with the numerous source systems common to most companies. Every system stores data in a custom and unique manner; it takes a lot of time to search and analyze source system data in order to identify the necessary content. (A popular ERP package stores its details in more than 10,000 tables) So, each new request often requires developers to create “from scratch” code to access and manipulate new data from a source system. If you dig a bit, you’ll probably find that many of your application systems generate dozens or hundreds (yes, hundreds) of custom extracts to deliver data to support the various production business needs within your company.
While most folks might think that custom extracts are a reasonably decent solution, they’re not. In fact, they’re a problem that will only get worse with time. (Remember, every extract requires development time and ongoing support.) You’ll be better off consolidating all of those extracts into a single set that includes all of the data. This will reduce processing time, reduce storage, reduce maintenance, and ultimately save a lot of money. You’ll have to spend some time designing and building these new extracts and getting folks to migrate to using them, but the benefits will be significant. (One of my clients was able to defer a platform upgrade due to the CPU and storage reduction caused by the consolidation and removal of all of the custom extracts).
Standardizing source data to reduce the data craftsmen problem isn’t rocket science, but it’s more than simply creating a data dump or generating a backup file. You need to deliver data in a manner that can be quickly and easily consumed by other systems. This means that the content needs to be reformatted from the unique (sometimes indecipherable) format of the host application into a format that everyone else can use. This can be easily addressed by delivering data into database tables or flat files (I know one client that delivers data in tab delimited spreadsheet format). The data should reflect the values generated by the source system in a format that everyone can understand – the content shouldn’t be modified for cleansed (this is source data, not content ready for business consumption). Delivery should occur in a frequent and regular basis along with a plan for archiving a decent amount of history.
This isn’t a new concept; this was a common approach in the days when custom coded IBM mainframe applications were all the rage. Back then, data sharing was a priority and every application generated standard extracts to reduce I/O and storage costs. There was also an extreme sensitivity to developer time. Requesting a custom extract was frowned upon and rarely approved. Finding and accessing the data was as simple as referencing the extract files that were made available from every application system.
When it comes to improving the delivery speed of new data to business users, maybe we can learn something from Henry Ford and the world of mainframe development.
I’ve always been a little jealous of ERP development teams. They operate on the premise that you have to standardize business processes across the enterprise. Every process feeds another process until the work is done. There are no custom processes: if you suddenly modify a business process there are upstream and downstream dependencies. Things could break.
We don’t have that luxury when we build MDM solutions for our clients. This was on my mind this past week when I was teaching my “Change Management for MDM” class in Las Vegas. The fact is that business people constantly add and modify their data. What’s important is that a consistent method exists for capturing and remediating these changes. The whole premise of MDM is that reference data changes all the time. Values are added, changed, and removed.
Let’s take the poster-child-du-jour, Toyota. Toyota has already announced that it will stop manufacturing its FJ Cruiser model in a few years. In the interest of its dealers, repair facilities, and after-market parts retailers, Toyota will need to get out in front of this change. There are catalogs to be modified, inventories to sell off, and cars to move. Likewise MDM environments can deal with data changes in advance. The hub needs to be prepared to respond to and support data changes at the right time.
We work a retailer that is constantly changing its merchandise with fluctuating purchase patterns and seasons. Adding spring merchandise to the inventory means new SKUs, new prices, and changes in product availability. Not every staff member in every store can anticipate all these new changes. Neither can the developers of the myriad operational systems. But with MDM they don’t have to keep up with all the new merchandise. The half-dozen applications that deal with inventory details can leverage the MDM hub as a clearing house of detailed changes, allowing them to be deployed in a scheduled manner according to the business calendar.
No more developers having to understand the details of hundreds of product categories and subcategories. No more one-off discussions between stores and suppliers. No more intensive manual work to change suppliers or substitute merchandise. No more updating POS systems with custom code. With MDM it’s all transparent to the applications—and to the people who use them.
Our most successful MDM engagements have confirmed what many of our clients already suspected but could never prove: that there are far more consumers of data than they knew. MDM formalizes the processes to ensure that data changes can scale to escalating volumes. It automates the communication of changes to the business areas and individuals who need to know about those changes, without needing to know each individual change.
With spring, shoppers may be thinking about new Easter outfits, gourmet items, or children’s clothes. But suppliers think about trucking capacity. Store managers can anticipate shelf and floor space requirements. Finance staff can prepare for potential product returns. Distribution center staff can allocate warehouse space. You can’t know everyone who needs the information. But the supply chain can become incredibly flexible and streamlined as a result of MDM.
And—okay, this makes me feel much better—it doesn’t even matter whether you have ERP or not!
Note: Evan will be presenting The Five Levels of MDM (and Data Governance!) Maturity next week at TDWI’s Master Data Quality and Governance Solutions Summit in Savannah, Georgia. The event is sold-out, so if you were lucky enough to get in, please stop by and say hello!
Photo by Rennett Stowe via Flickr (Creative Commons License)