This blog is 4th in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the component Assemble and the numerous details involved with sourcing, cleansing, standardizing, preparing, integrating, and moving the data to make it ready to use.
The definition of Assemble is:
“Cleansing, standardizing, combining, and moving data residing in multiple locations and producing a unified view”
In the Data Strategy context, Assemble includes all of the activities required to transform data from its host-oriented application context to one that is “ready to use” and understandable by other systems, applications, and users.
Most data used within our companies is generated from the applications that run the company (point-of-sale, inventory management, HR systems, accounting) . While these applications generate lots of data, their focus is on executing specific business functions; they don’t exist to provide data to other systems. Consequently, the data that is generated is “raw” in form; the data reflects the specific aspects of the application (or system of origin). This often means that the data hasn’t been standardized, cleansed, or even checked for accuracy. Assemble is all of the work necessary to convert data from a “raw” state to one that is ready for business usage.
I’ve identified 5 facets to consider when developing your Data Strategy that are commonly employed to make data “ready to use”. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely want to consider different options for each. Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.
Identification and Matching
Data integration is one of the most prevalent data activities occurring within a company; it’s a basic activity employed by developers and users alike. In order to integrate data from multiple sources, it’s necessary to determine the identification values (or keys) from each source (e.g. the employee id in an employee list, the part number in a parts list). The idea of matching is aligning data from different sources with the same identification values. While numeric values are easy to identify and match (using the “=” operator), character-based values can be more complex (due to spelling irregularities, synonyms, and mistakes).
Even though it’s highly tactical, Identification and matching is important to consider within a Data Strategy to ensure that data integration is processed consistently. And one of the (main) reasons that data variances continue to exist within companies (despite their investments in platforms, tools, and repositories) is because the need for standardized Identification and Matching has not been addressed.
Survivorship is a pretty basic concept: the selection of the values to retain (or survive) from the different sources that are merged. Survivorship rules are often unique for each data integration process and typically determined by the developer. In the context of a data strategy, it’s important to identify the “systems of reference” because the identification of these systems provide clarity to developers and users to understand which data elements to retain when integrating data from multiple systems.
Standardize / Cleanse
The premise of data standardization and cleansing is to identify inaccurate data and correct and reformat the data to match the requirements (or the defined standards) for a specific business element. This is likely the single most beneficial process to improve the business value (and the usability) of data. The most common challenge to data standardization and cleansing is that it can be difficult to define the requirements. The other challenge is that most users aren’t aware that their company’s data isn’t standardized and cleansed as a matter of practice. Even though most companies have multiple tools to cleanup addresses, standardize descriptive details, and check the accuracy of values, the use of these tools is not common.
Wikipedia defines reference data as data that is used to classify or categorize other data. In the context of a data strategy, reference data is important because it ensures the consistency of data usage and meaning across different systems and business areas. Successful reference data means that details are consistently identified, represented, and formatted the same way across all aspects of the company (if the color of a widget is “RED”, then the value is represented as “RED” everywhere – not “R” in product information system, 0xFF0000 in inventory system, and 0xED2939 in product catalog). A Reference Data initiative is often aligned with a company’s data strategy initiative because of its impact to data sharing and reuse.
The idea of movement is to record the different systems that a data element touches as it travels (and is processed) after the data element is created. Movement tracking (or data lineage) is quite important when the validity and accuracy of a particular data value is questioned. And in the current era of heightened consumer data privacy and protection, the need for data lineage and tracking of consumer data within a company is becoming a requirement (and it’s the law in California and the European Union).
The dramatic increase in the quantity and diversity of data sources within most companies over the past few years has challenged even the most technology advanced organizations. It’s not uncommon to find one of the most visible areas of user frustration to be associated with accessing new (or additional) data sources. Much of this frustration occurs because of the challenge in sourcing, integrating, cleansing, and standardizing new data content to be shared with users. As is the case with all of the other components, the details are easy to understand, but complex to implement. A company’s data strategy has to evolve and change when data sharing becomes a production business requirement and users want data that is “ready to use”.
This blog is the 2nd in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the concept of data provisioning and the various details of making data sharable.
The definition of Provision is:
“Supplying data in a sharable form while respecting all rules and access guidelines”
One of the biggest frustrations that I have in the world of data is that few organizations have established data sharing as a responsibility. Even fewer have setup the data to be ready to share and use by others. It’s not uncommon for a database programmer or report developer to have to retrieve data from a dozen different systems to obtain the data they need. And, the data arrives in different formats and files that change regularly. This lack of consistency generates large ongoing maintenance costs and requires an inordinate amount of developer time to re-transform, prepare, fix data to be used (numerous studies have found that ongoing source data maintenance can take as much of 50% of the database developers time after the initial programming effort is completed).
Should a user have to know the details (or idiosyncrasies) of the application system that created the data to use the data? (That’s like expecting someone to understand the farming of tomatoes and manufacturing process of ketchup in order to be able to put ketchup on their hamburger). The idea of Provision is to establish the necessary rigor to simplify the sharing of data.
I’ve identified 5 of the most common facets of data sharing in the illustration above – there are others. As a reminder (from last week’s blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to review the different options for each facet. Each facet can target a small organization’s issues or expand to address a diverse enterprise’s needs.
This is the most obvious aspect of provisioning: structuring and formatting the data in a clear and understandable manner to the data consumer. All too often data is packaged at the convenience of the developer instead of the convenience of the user. So, instead of sharing data as a backup file generated by an application utility in a proprietary (or binary) format, the data should be formatted so every field is labeled and formatted (text, XML) for a non-technical user to access using easily available tools. The data should also be accompanied with metadata to simplify access.
This facet works with Packaging and addresses the details associated with the data container. Data can be shared via a file, a database table, an API, or one of several other methods. While sharing data in a programmer generated file is better than nothing, a more effective approach would be to deliver data in a well-known file format (such as Excel) or within a table contained in an easily accessible database (e.g. data lake or data warehouse).
Source data stewardship is critical in the sharing of data. In this context, a Source Data Steward is someone that is responsible for supporting and maintaining the shared data content (there several different types of data stewards). In some companies, there’s a data steward responsible for the data originating from an individual source system. Some companies (focused on sharing enterprise-level content) have positioned data stewards to support individual subject areas. Regardless of the model used, the data steward tracks and communicates source data changes, monitors and maintains the shared content, and addresses support needs. This particular role is vital if your organization is undertaking any sort of data self-service initiative.
This item addresses the issues that are common in the world of electronic data sharing: inconsistency, change, and error. Acceptance checking is a quality control process that reviews the data prior to distribution to confirm that it matches a set of criteria to ensure that all downstream users receive content as they expect. This item is likely the easiest of all details to implement given the power of existing data quality and data profiling tools. Unfortunately, it rarely receives attention because of most organization’s limited experience with data quality technology.
In order to succeed in any sort of data sharing initiative, whether in supporting other developers or an enterprise data self-service initiative, it’s important to identify the audience that will be supported. This is often the facet to consider first, and it’s valuable to align the audience with the timeframe of data sharing support. It’s fairly common to focus on delivering data sharing for developers support first followed by technical users and then the large audience of business users.
In the era of “data is a business asset” , data sharing isn’t a courtesy, it’s an obligation. Data sharing shouldn’t occur at the convenience of the data producer, it should be packaged and made available for the ease of the user.
Because the idea of building a data strategy is a fairly new concept in the world of business and information technology (IT), there’s a fair amount of discussion about the pieces and parts that comprise a Data Strategy. Most IT organizations have invested heavily in developing plans to address platforms, tools, and even storage. Those IT plans are critical in managing systems and capturing and retaining content generated by a company’s production applications. Unfortunately, those details don’t typically address all of the data activities that occur after an application has created and processed data from the initial business process. The reasons that folks take on the task of developing a Data Strategy is because of the challenges in finding, identifying, sharing, and using data. In any company, there are numerous roles and activities involved in delivering data to support business processing and analysis. A successful Data Strategy must support the breadth of activities necessary to ensure that data is “ready to use”.
There are five core components in a data strategy that work together as building blocks to address the various details necessary to comprehensively support the management and usage of data.
Identify The ability to identify data and understand its meaning regardless of its structure, origin, or location.
This concept is pretty obvious, but it’s likely one of the biggest obstacles in data usage and sharing. All too often, companies have multiple and different terms for specific business details (customer: account, client, patron; income: earnings, margin, profit). In order to analyze, report, or use data, people need to understand what it’s called and how to identify it. Another aspect of Identify is establishing the representation of the data’s value (Are the company’s geographic locations represented by name, number, or an abbreviation?) A successful Data Strategy would identify the gaps and needs in this area and identify the necessary activities and artifacts required to standardize data identification and representation.
Provision Enabling data to be packaged and made available while respecting all rules and access guidelines.
Data is often shared or made available to others at the convenience of the source system’s developers. The data is often accessible via database queries or as a series of files. There’s rarely any uniformity across systems or subject areas, and usage requires programming level skills to analyze and inventory the contents of the various tables or files. Unfortunately, the typical business person requiring data is unlikely to possess sophisticated programming and data manipulation skills. They don’t want raw data (that reflects source system formats and inaccuracies), they want data that is uniformly formatted and documented that is ready to be added to their analysis activities.
The idea of Provision is to package and provide data that is “ready to use”. A successful Data Strategy would identify the various data sharing needs and identify the necessary methods, practices, and tooling required to standardize data packaging and sharing.
Store Persisting data in a structure and location that supports access and processing across the enterprise.
Most IT organizations have solid plans for addressing this area of a Data Strategy. It’s fairly common for most companies to have a well-defined set of methods to determine the platform where online data is stored and processed, how data is archived for disaster recovery, and all of the other details such as protection, retention, and monitoring.
As the technology world has evolved, there are other facets of this area that require attention. The considerations include managing data distributed across multiple locations (the cloud, premise systems, and even multiple desktops), privacy and protection, and managing the proliferation of copies. With the emergence of new consumer privacy laws, it’s risky to store multiple copies of data, and it’s become necessary to track all existing copies of content. A successful Data Strategy ensures that any created data is always available for future access without requiring everyone to create their own copy.
Assemble Standardizing, combining, and moving data residing in multiple locations and providing a unified view.
It’s no secret that data integration is one of the more costly activities occurring within an IT organization; nearly 40% of the cost of new development is consumed by data integration activities. And Assemble isn’t limited to integration, it also includes correcting, standardizing, and formatting the content to make it “ready to use”.
With the growth of analytics and desktop decisioning making, the need to continually analyze and include new data sets into the decision-making process has exploded. Processing (or preparing or wrangling) data is no longer confined to the domain of the IT organization, it has become an end user activity. A successful Data Strategy had to ensure that all users can be self-sufficient in their abilities to process data.
Govern Establishing and communicating information rules, policies, and mechanisms to ensure effective data usage.
While most organizations are quick to identify their data as a core business asset, few have put the necessary rigor in place to effectively manage data. Data Governance is about establishing rules, policies, and decision mechanisms to allow individuals to share and use data in a manner that respects the various (legal and usage) guidelines associated with that data. The inevitable challenge with Data Governance is adoption by the entire data supply chain – from application developers to report developers to end users. Data Governance isn’t a user-oriented concept, it’s a data-oriented concept. A successful Data Strategy identifies the rigor necessary to ensure a core business asset is managed and used correctly.
The 5 Components of a Data Strategy is a framework to ensure that all of a company’s data usage details are captured and organized and that nothing is unknowingly overlooked. A successful Data Strategy isn’t about identifying every potential activity across the 5 different components. It’s about making sure that all of the identified solutions to the problems in accessing, sharing, and using data are reviewed and addressed in a thorough manner.
I read an interesting tidbit about data the other day: the United States Postal Service processed more than 47 million changes of addresses in the last year. That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality. The idea of stale data is that as data ages, its accuracy and associated business rules can change.
There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality: people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change. Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value. The challenge with maintaining any asset is establishing a reasonable maintenance plan.
Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention. In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner. Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.) This happens so often, most data warehouse folks have even given it a name: Recasting History. Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.
- Establish a regularly scheduled data maintenance window. Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
- Collect and publish data quality details. Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
- Keep the original data. Most data quality processing overwrites original content with new details. Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
- Add source system identification and creation date/time details to every record. While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
- Schedule a regular data change control meeting. This too is similar in concept to the change control meeting associated with IT operations teams. This is a forum for discussing data content issues and changes
Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun. Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting. They simply don’t have the resources or time to address data maintenance as a proactive activity.
Business users clamor for new data and new reports; new funding is always tied to new business capabilities. Support costs are budgeted, but they’re focused on software and hardware maintenance activities. No one ever considers data maintenance; it’s simply ignored and forgotten.
Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement. And where are we investing in maintenance? The commodity tools, not the strategic corporate asset.
Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).
At the recent Gartner MDM Summit in Las Vegas I was approached at least a half a dozen times by people wondering what MDM vendor to choose. I gave my usual response, which was, “What are you trying to accomplish?”
Normally a (short) conversation ensues of functions, feeds and speeds, which then leads to my next question, “So, what are your priorities and decision criteria? The responses were all the same, and I have to admit that they surprised me.
“We know we need MDM, but our company hasn’t really decided what MDM is. Since we’re already a [Microsoft / IBM / SAP / Oracle / SAS] shop, we just thought we’d buy their product…so what do you think of their product?”
I find this type of question interesting and puzzling. Why would anyone blindly purchase a product because of the vendor, rather than focusing on needs, priorities, and cost metrics? Unless a decision has absolutely no risk or cost, I’m not clear how identifying a vendor before identifying the requirements could possibly have a successful outcome.
If I look in my refrigerator, not all my products have the same brand label. My taste, interests, and price tolerance vary based upon the product. My catsup comes from one company, my salad dressing comes from another, and I have about seven different types of mustard (long story). Likewise, my TV, DVD player, surround sound system, DVR, and even my remote control are all different brands. Despite the advertisers’ claims, no single company has the best feature set across all products. For those of you who are loyal to a single brand, you can stop reading now. I’m sure you think I’m nuts.
The fact is that different vendors have different strengths, and this causes their products to differ. Buyers of these products should focus on their requirements and needs, not the product’s functions and features. Somehow this type of logic seems to escape otherwise smart business people. A good decision can deliver enormous benefits to a company; a bad decision can deliver enormous benefits to a company’s competitors.
What other reason would there be for someone saying, “We’re a [vendor name here] shop?” Examples abound of vendors abandoning products. IBM’s Intelligent Miner data mining tool, OS/2, the Apple Newton, Microsoft Money are but a few of the many examples.
Working with a reputable vendor is smart. Gathering requirements, reviewing product features, and determining the best match creates the opportunity for developing a client/vendor partnership. So why would anyone throw all of that out and just decide to pick a vendor? I guess lots of folks thought that Bernie Madoff was their partner. Need I say more?
photo by xJasonRogersx via Flickr (Creative Common License)
I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.
One of many discussions I heard over Thanksgiving turkey was, “How could the government have let the financial crisis happen?” To which the most frequent response was that regulators were asleep at the wheel. True or not, one could legitimately ask why we have problems with our business intelligence reports. The data is bad and the report is meaningless—who’s asleep at the wheel?
Everyone’s talking about the single version of the truth, but how often are our reports reviewed for accuracy? Several of our financial services clients demand that their BI reports are audited back to the source systems and that numbers are reconciled.
Unfortunately, this isn’t common practice across industries. When we work with new clients we ask about data reconciliation, but most of our new clients don’t have the methods or processes in place. It makes me wonder how engaged business users are in establishing audit and reconciliation rules for their BI capabilities.
No, data perfection isn’t practical. But we should be able to guard against lost data and protect our users from formulas and equations that change. All too often these issues are thrown into the “post development” bucket or relegated to User Acceptance. By then reports aren’t always corrected and data isn’t always fixed.
A robust development process should ensure that data accuracy should be established and measured throughout development. This means that design reviews are necessary before, during, and after development. Design reviews ensure that the data is continually being processed accurately. Many believe that it’s ten or more times more expensive to fix broken code (or data) after development than it is during development. And, as we’ve all seen, often the data doesn’t get fixed at all.
When you’re building a report or delivering data, ask two questions: 1) whether the numbers reflect business expectations, and 2) if they reconcile back to their system of origin. Design review processes should be instituted (or, in many cases, re-instituted) to ensure functional accuracy long before the user every sees the data on her desktop.
When it comes to bad data, a lot of the problem stems from companies letting their developers off the hook. That’s right. When it comes to delivering, maintaining, and justifying their code, developers are given a lot of rope. When projects start, everyone nods their head in agreement when data quality comes up. But then there’s scope creep and sizing mistakes, and projects run long.
People start looking for things to remove. And writing error detection and correction code is not only complicated, it’s not sexy. It’s like writing documentation; no one wants to do it because it’s detailed and time consuming. This is the finish work: it’s the fancy veneer, the polished trim, and the paint color. Software vendors get this. If a data entry error shows up in a demo or a software review, it could make or break that product’s reputation. When was the last time any Windows product let you save a file with an invalid name? It doesn’t happen. The last thing a Word user needs is to sweat blood over a document and then never be able to open it again because it was named with an untypeable character.
Error detection and correction code are core aspects of development and require rigorous review. Accurate data isn’t just a business requirement—it’s common sense. Users shouldn’t have to explain to developers why inaccurate values aren’t allowed. Do you think that the business users at Amazon.com had to tell their developers that “The Moon” was an invalid delivery address? But all too often developers don’t think they have any responsibility for data entry errors.
When a system creates data, and when that data leaves that system, the data should be checked and corrected. Bad data should be viewed as a hazardous material that should not be transported. The moment you generate data, you have the implicit responsibility to establish its accuracy and integrity. Distributing good data to your competitors is unacceptable; distributing bad data to your team is irresponsible. And when bad data is ignored, it’s negligence.
While everyone—my staff members, included—wants to talk about data governance, policy-making, and executive councils, it all starts with bad data being input into systems in the first place. So, what if we fixed it at the beginning?
Photo by Random J via Flickr (Creative Commons License)
A recent client experience reminds me what I’ve always said about data quality: it isn’t the same as data perfection. After all, how could it be? A lot of people think that correcting data is a post-facto activity based on opinion and anecdotal problems. But it should be an entrenched process.
One drop of motor oil can pollute 25 quarts of drinking water. But it’s not the same with data. On the other hand, an average of less than 75 insect fragments per 50 grams of wheat flour is acceptable. (Jill says this is “apocryphal,” but you get my point.)
People forget that the definition of data quality is data that’s fit for purpose. It conforms to requirements. You only have to look back at the work of Philip Crosby and W. Edwards Demming to understand that quality is about conformance to requirements. We need to understand the variance between the data as it exists and its acceptability, not its perfection.
The reason data quality gets so much attention is when bad data gets in the way of getting the job done. If I want to send an e-mail to 10,000 customers and one customer’s zip code is unknown, it doesn’t prevent me from contacting the other 9999 customers. That can amount to what in any CMO’s estimation is a very successful marketing campaign. The question should be: What data helps us get the job done?
Our client is a regional bank that has retained Baseline to work with its call center staff. Customer service reps (CSRs) have been frustrated that they get multiple records for the same customer. They had to jump through hoops to find the right data, often while the customer waited on the phone, or on-line. The problem wasn’t that the data was “bad”—it was that the CSRs could only use the customer’s phone number to look up the record. If the phone number was incorrect, the CSR can’t do her job. And as a result, her compensation suffers. So data quality is very important to her. And to the bank at large.
Users are all too accustomed to complaining about data. The goal of data quality should be continuous improvement, ensuring a process is available to fix data when it’s broken. If you want to address data quality, focus energy on the repair process. As long as your business is changing—and I hope it is—its data will continue to change. Data requirements, measurements, and the reference points for acceptability will keep changing too. If you’re involved in a data quality program, think of it as job security.