I read an interesting tidbit about data the other day: the United States Postal Service processed more than 47 million changes of addresses in the last year. That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality. The idea of stale data is that as data ages, its accuracy and associated business rules can change.
There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality: people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change. Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value. The challenge with maintaining any asset is establishing a reasonable maintenance plan.
Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention. In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner. Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.) This happens so often, most data warehouse folks have even given it a name: Recasting History. Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.
- Establish a regularly scheduled data maintenance window. Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
- Collect and publish data quality details. Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
- Keep the original data. Most data quality processing overwrites original content with new details. Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
- Add source system identification and creation date/time details to every record. While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
- Schedule a regular data change control meeting. This too is similar in concept to the change control meeting associated with IT operations teams. This is a forum for discussing data content issues and changes
Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun. Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting. They simply don’t have the resources or time to address data maintenance as a proactive activity.
Business users clamor for new data and new reports; new funding is always tied to new business capabilities. Support costs are budgeted, but they’re focused on software and hardware maintenance activities. No one ever considers data maintenance; it’s simply ignored and forgotten.
Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement. And where are we investing in maintenance? The commodity tools, not the strategic corporate asset.
Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).
At the recent Gartner MDM Summit in Las Vegas I was approached at least a half a dozen times by people wondering what MDM vendor to choose. I gave my usual response, which was, “What are you trying to accomplish?”
Normally a (short) conversation ensues of functions, feeds and speeds, which then leads to my next question, “So, what are your priorities and decision criteria? The responses were all the same, and I have to admit that they surprised me.
“We know we need MDM, but our company hasn’t really decided what MDM is. Since we’re already a [Microsoft / IBM / SAP / Oracle / SAS] shop, we just thought we’d buy their product…so what do you think of their product?”
I find this type of question interesting and puzzling. Why would anyone blindly purchase a product because of the vendor, rather than focusing on needs, priorities, and cost metrics? Unless a decision has absolutely no risk or cost, I’m not clear how identifying a vendor before identifying the requirements could possibly have a successful outcome.
If I look in my refrigerator, not all my products have the same brand label. My taste, interests, and price tolerance vary based upon the product. My catsup comes from one company, my salad dressing comes from another, and I have about seven different types of mustard (long story). Likewise, my TV, DVD player, surround sound system, DVR, and even my remote control are all different brands. Despite the advertisers’ claims, no single company has the best feature set across all products. For those of you who are loyal to a single brand, you can stop reading now. I’m sure you think I’m nuts.
The fact is that different vendors have different strengths, and this causes their products to differ. Buyers of these products should focus on their requirements and needs, not the product’s functions and features. Somehow this type of logic seems to escape otherwise smart business people. A good decision can deliver enormous benefits to a company; a bad decision can deliver enormous benefits to a company’s competitors.
What other reason would there be for someone saying, “We’re a [vendor name here] shop?” Examples abound of vendors abandoning products. IBM’s Intelligent Miner data mining tool, OS/2, the Apple Newton, Microsoft Money are but a few of the many examples.
Working with a reputable vendor is smart. Gathering requirements, reviewing product features, and determining the best match creates the opportunity for developing a client/vendor partnership. So why would anyone throw all of that out and just decide to pick a vendor? I guess lots of folks thought that Bernie Madoff was their partner. Need I say more?
photo by xJasonRogersx via Flickr (Creative Common License)
I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.
One of many discussions I heard over Thanksgiving turkey was, “How could the government have let the financial crisis happen?” To which the most frequent response was that regulators were asleep at the wheel. True or not, one could legitimately ask why we have problems with our business intelligence reports. The data is bad and the report is meaningless—who’s asleep at the wheel?
Everyone’s talking about the single version of the truth, but how often are our reports reviewed for accuracy? Several of our financial services clients demand that their BI reports are audited back to the source systems and that numbers are reconciled.
Unfortunately, this isn’t common practice across industries. When we work with new clients we ask about data reconciliation, but most of our new clients don’t have the methods or processes in place. It makes me wonder how engaged business users are in establishing audit and reconciliation rules for their BI capabilities.
No, data perfection isn’t practical. But we should be able to guard against lost data and protect our users from formulas and equations that change. All too often these issues are thrown into the “post development” bucket or relegated to User Acceptance. By then reports aren’t always corrected and data isn’t always fixed.
A robust development process should ensure that data accuracy should be established and measured throughout development. This means that design reviews are necessary before, during, and after development. Design reviews ensure that the data is continually being processed accurately. Many believe that it’s ten or more times more expensive to fix broken code (or data) after development than it is during development. And, as we’ve all seen, often the data doesn’t get fixed at all.
When you’re building a report or delivering data, ask two questions: 1) whether the numbers reflect business expectations, and 2) if they reconcile back to their system of origin. Design review processes should be instituted (or, in many cases, re-instituted) to ensure functional accuracy long before the user every sees the data on her desktop.
When it comes to bad data, a lot of the problem stems from companies letting their developers off the hook. That’s right. When it comes to delivering, maintaining, and justifying their code, developers are given a lot of rope. When projects start, everyone nods their head in agreement when data quality comes up. But then there’s scope creep and sizing mistakes, and projects run long.
People start looking for things to remove. And writing error detection and correction code is not only complicated, it’s not sexy. It’s like writing documentation; no one wants to do it because it’s detailed and time consuming. This is the finish work: it’s the fancy veneer, the polished trim, and the paint color. Software vendors get this. If a data entry error shows up in a demo or a software review, it could make or break that product’s reputation. When was the last time any Windows product let you save a file with an invalid name? It doesn’t happen. The last thing a Word user needs is to sweat blood over a document and then never be able to open it again because it was named with an untypeable character.
Error detection and correction code are core aspects of development and require rigorous review. Accurate data isn’t just a business requirement—it’s common sense. Users shouldn’t have to explain to developers why inaccurate values aren’t allowed. Do you think that the business users at Amazon.com had to tell their developers that “The Moon” was an invalid delivery address? But all too often developers don’t think they have any responsibility for data entry errors.
When a system creates data, and when that data leaves that system, the data should be checked and corrected. Bad data should be viewed as a hazardous material that should not be transported. The moment you generate data, you have the implicit responsibility to establish its accuracy and integrity. Distributing good data to your competitors is unacceptable; distributing bad data to your team is irresponsible. And when bad data is ignored, it’s negligence.
While everyone—my staff members, included—wants to talk about data governance, policy-making, and executive councils, it all starts with bad data being input into systems in the first place. So, what if we fixed it at the beginning?
Photo by Random J via Flickr (Creative Commons License)
A recent client experience reminds me what I’ve always said about data quality: it isn’t the same as data perfection. After all, how could it be? A lot of people think that correcting data is a post-facto activity based on opinion and anecdotal problems. But it should be an entrenched process.
One drop of motor oil can pollute 25 quarts of drinking water. But it’s not the same with data. On the other hand, an average of less than 75 insect fragments per 50 grams of wheat flour is acceptable. (Jill says this is “apocryphal,” but you get my point.)
People forget that the definition of data quality is data that’s fit for purpose. It conforms to requirements. You only have to look back at the work of Philip Crosby and W. Edwards Demming to understand that quality is about conformance to requirements. We need to understand the variance between the data as it exists and its acceptability, not its perfection.
The reason data quality gets so much attention is when bad data gets in the way of getting the job done. If I want to send an e-mail to 10,000 customers and one customer’s zip code is unknown, it doesn’t prevent me from contacting the other 9999 customers. That can amount to what in any CMO’s estimation is a very successful marketing campaign. The question should be: What data helps us get the job done?
Our client is a regional bank that has retained Baseline to work with its call center staff. Customer service reps (CSRs) have been frustrated that they get multiple records for the same customer. They had to jump through hoops to find the right data, often while the customer waited on the phone, or on-line. The problem wasn’t that the data was “bad”—it was that the CSRs could only use the customer’s phone number to look up the record. If the phone number was incorrect, the CSR can’t do her job. And as a result, her compensation suffers. So data quality is very important to her. And to the bank at large.
Users are all too accustomed to complaining about data. The goal of data quality should be continuous improvement, ensuring a process is available to fix data when it’s broken. If you want to address data quality, focus energy on the repair process. As long as your business is changing—and I hope it is—its data will continue to change. Data requirements, measurements, and the reference points for acceptability will keep changing too. If you’re involved in a data quality program, think of it as job security.