One of the challenges in delivering successful data-centric projects (e.g. analytics, BI, or reporting) is realizing that the definition of project success differs from traditional IT application projects. Success for a traditional application (or operational) project is often described in terms of transaction volumes, functional capabilities, processing conformance, and response time; data project success is often described in terms of business process analysis, decision enablement, or business situation measurement. To a business user, the success of a data-centric project is simple: data usability.
It seems that most folks respond to data usability issues by gravitating towards a discussion about data accuracy or data quality; I actually think the more appropriate discussion is data knowledge. I don’t think anyone would argue that to make data-enabled decisions, you need to have knowledge about the underlying data. The challenge is understanding what level of knowledge is necessary. If you ask a BI or Data Warehouse person, their answer almost always includes metadata, data lineage, and a data dictionary. If you ask a data mining person, they often just want specific attributes and their descriptions — they don’t care about anything else. All of these folks have different views of data usability and varying levels (and needs) for data knowledge.
One way to improve data usability is to target and differentiate the user audience based on their data knowledge needs. There are certainly lots of different approaches to categorizing users; in fact, every analyst firm and vendor has their own model to describe different audience segments. One of the problems with these types of models is that they tend to focus heavily on the tools or analytical methods (canned reports, drill down, etc.) and ignore the details of data content and complexity. The knowledge required to manipulate a single subject area (revenue or customer or usage) is significantly less than the skills required to manipulate data across 3 subject areas (revenue, customer, and usage). And what exacerbates data knowledge growth is the inevitable plethora of value gaps, inaccuracies, and inconsistencies associated with the data. Data knowledge isn’t just limited to understanding the data; it includes understanding how to work around all of the imperfections.
Here’s a model that categories and describes business users based on their views of data usability and their data knowledge needs
Level 1: “Can you explain these numbers to me?”
This person is the casual data user. They have access to a zillion reports that have been identified by their predecessors and they focus their effort on acting on the numbers they get. They’re not a data analyst – their focus is to understand the meaning of the details so they can do their job. They assume that the data has been checked, rechecked, and vetted by lots of folks in advance of their receiving the content. They believe the numbers and they act on what they see.
Level 2: “Give me the details”
This person has been using canned reports, understands all the basic details, and has graduated to using data to answer new questions that weren’t identified by their predecessors. They need detailed data and they want to reorganize the details to suit their specific needs (“I don’t want weekly revenue breakdowns – I want to compare weekday revenue to weekend revenue”). They realize the data is imperfect (and in most instances, they’ll live with it). They want the detail.
Level 3: “I don’t believe the data — please fix it”
These folks know their area of the business inside/out and they know the data. They scour and review the details to diagnose the business problems they’re analyzing. And when they find a data mistake or inaccuracy, they aren’t shy about raising their hand. Whether they’re a data analyst that uses SQL or a statistician with their favorite advanced analytics algorithms, they focus on identifying business anomalies. These folks are the power users that are incredibly valuable and often the most difficult for IT to please.
Level 4: “Give me more data”
This is subject area graduation. At this point, the user has become self-sufficient with their data and needs more content to address a new or more complex set of business analysis needs. Asking for more data – whether a new source or more detail – indicates that the person has exhausted their options in using the data they have available. When someone has the capacity to learn a new subject area or take on more detailed content, they’re illustrating a higher level of data knowledge.
One thing to consider about the above model is that a user will have varying data knowledge based on the individual subject area. A marketing person may be completely self-sufficient on revenue data but be a newbie with usage details. A customer support person may be an expert on customer data but only have limited knowledge of product data. You wouldn’t expect many folks (outside of IT) to be experts on all of the existing data subject areas. Their knowledge is going to reflect the breadth of their job responsibilities.
As someone grows and evolves in business expertise and influence, it’s only natural that their business information needs would grow and evolve too. In order to address data usability (and project success), maybe it makes sense to reconsider the various user audience categories and how they are defined. Growing data knowledge isn’t about making everyone data gurus; it’s about enabling staff members to become self-sufficient in their use of corporate data to do their jobs.
Photo “Ladder of Knowledge” courtesy of degreezero2000 via Flickr (Creative Commons license).
Companies spend a small fortune continually investing and reinvesting in making their business analysts self-sufficient with the latest and greatest analytical tools. Most companies have multiple project teams focused on delivering tools to simplify and improve business decision making. There are likely several standard tools deployed to support the various data analysis functions required across the enterprise: canned/batch reports, desktop ad hoc data analysis, and advanced analytics. There’s never a shortage of new and improved tools that guarantee simplified data exploration, quick response time, and greater data visualization options, Projects inevitably include the creation of dozens of prebuilt screens along with a training workshop to ensure that the users understand all of the new whiz bang features associated with the latest analytic tool incarnation. Unfortunately, the biggest challenge within any project isn’t getting users to master the various analytical functions; it’s ensuring the users understand the underlying data they’re analyzing.
If you take a look at the most prevalent issue with the adoption of a new business analysis tool is the users’ knowledge of the underlying data. This issue becomes visible with a number of common problems: the misuse of report data, the misunderstanding of business terminology, and/or the exaggeration of inaccurate data. Once the credibility or usability of the data comes under scrutiny, the project typically goes into “red alert” and requires immediate attention. If ignored, the business tool quickly becomes shelfware because no one is willing to take a chance on making business decisions based on risky information.
All too often the focus on end user training is tool training, not data training. What typically happens is that an analyst is introduced to the company’s standard analytics tool through a “drink from a fire hose” training workshop. All of the examples use generic sales or HR data to illustrate the tool’s strengths in folding, spindling, and manipulating the data. And this is where the problem begins: the vendor’s workshop data is perfect. There’s no missing or inaccurate data and all of the data is clearly labeled and defined; classes run smoothly, but it just isn’t reality Somehow the person with no hands-on data experience is supposed to figure out how to use their own (imperfect) data. It’s like someone taking their first ski lesson on a cleanly groomed beginner hill and then taking them up to the top of an a black diamond (advanced) run with step hills and moguls. The person works hard but isn’t equipped to deal with the challenges of the real world. So, they give up on the tool and tell others that the solution isn’t usable.
All of the advanced tools and manipulation capabilities don’t do any good if the users don’t understand the data. There are lots of approaches to educating users on data. Some prefer to take a bottom-up approach (reviewing individual table and column names, meanings, and values) while others want to take a top-down approach (reviewing subject area details, the associated reports, and then getting into the data details). There are certainly benefits of one approach over the other (depending on your audience); however, it’s important not to lose sight of the ultimate goal: giving the users the fundamental data knowledge they need to make decisions. The fundamentals that most users need to understand their data include a review of
- the business subject area associated with their dat
- business terms, definitions, and their associated data attributes
- data values and their representations
- business rules and calculations associated with the individual values
- the data’s origin (a summary of the business processes and source system)
The above details may seem a bit overwhelming if you consider that most companies have mature reporting environments and multi-terabyte data warehouses. However, we’re not talking about training someone to be an expert on 1000 data attributes contained within your data warehouse; we’re talking about ensuring someone’s ability to use an initial set of reports or a new tool without requiring 1-on-1 training. It’s important to realize that the folks with the greatest need for support and data knowledge are the newbies, not the experienced folks.
There are lots of options for imparting data knowledge to business users: a hands-on data workshop, a set of screen videos showing data usage examples, or a simple set of web pages containing definitions, textual descriptions, and screen shots. Don’t get wrapped up in the complexities of creating the perfect solution – keep it simple. I worked with a client that deployed their information using a set of pages constructed with PowerPoint that folks could reference in a the company’s intranet. If your users have nothing – don’t’ worry about the perfect solution – give them something to start with that’s easy to use.
Remember that the goal is to build users’ data knowledge that is sufficient to get them to adopt and use the company’s analysis tools. We’re not attempting to convert everyone into data scientists; we just want them to use the tools without requiring 1-on-1 training to explain every report or data element.
Photo courtesy of NASA. Nasa Ames Research Center engineer H Julian “Harvey” Allen illustrating data knowledge (relating to capsule design for the Mercury program)
I always find it interesting when people pile onto the company’s latest and most popular project or initiative. People love to gravitate to whatever is new and sexy within the company, regardless of what they’re working on or their current responsibilities. There never seems to be a shortage of the “bright shiny object” syndrome – you know, organizational ADHD. This desire to jump on the band wagon often positions individuals with limited experience to own and drive activities they don’t fully understand. The world of data governance is rife with supporters and promoters that are thrilled to be involved, but a bit unprepared to participate and execute. It’s like loading a gun and pulling the trigger before aiming – you’ll make a lot of noise and likely miss the target. If only folks spent a bit of time educating others about the meaning and purpose of data governance before they got started.
Let me first offer up some definitions from a few reputable sources…
“Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise” (Wikipedia)
“The process by which an organization formalizes the ‘fiduciary duty’ for the management of data assets” (Forrester Research)
“…the overall management of the availability, usability, integrity, and security of the data employed in an enterprise” (TechTarget)
For those of you that have experience with data governance, the above definitions are unlikely to be much of a surprise. For the other 99%, there’s likely to be some head scratching. I actually think most folks that haven’t been indoctrinated to the religion of data have just assumed that data governance is simply a new incarnation of yesterday’s data quality or metadata discussion. That probably shouldn’t be much of a surprise; the discussion of data inaccuracy and data dictionaries has gotten so much air time over the past 30 years, the typical business user probably feels brainwashed when they hear anything with “data” in the title. I actually think that Data Governance may win the prize for being among the most misunderstood concepts within Information Technology.
Data governance is a very simple concept. Data Governance is about establishing the processes for accessing and sharing data and resolving conflict when the processes don’t work.
A Data Governance initiative is really about instilling the concept of managing data as a corporate asset. Companies have standard methods and processes for asset management: your Procurement group has a slew of rules and processes to support the purchasing of office supplies; the HR organization has rules and guidelines for hiring and managing staff; and the finance organization follows “generally accepted accounting principles” to handle managing the company’s fixed and financial assets. Unfortunately, what we don’t have is a set of generally accepted principles for data. This is what data governance establishes.
The reason that you see the term process in nearly every definition of data governance is that until you establish and standardize data related processes, you’ll never get any of the work done. Getting started with data governance isn’t about establishing a committee – it’s about identifying the goals and identifying the policies and processes that will direct the work activities. You can’t be successful in managing an asset if everyone has their own rules and methods for accessing, manipulating, and using the asset. This isn’t rocket science – geez – the world of ERP implementations and even business reengineering projects learned this concept more than 10 years ago.
The reason to manage data as a corporate asset is to ensure that business activities that require data are able to use and access data in a simple, uniform, consistent manner. Unfortunately, in the era of search engines, content indexing, data warehouses, and the Cloud, finding and acquiring data to support a new business need can be painful, time consuming, and expensive. Everyone has their own terms, their own private data stash, and their own rules dictating who is and isn’t allowed to access data. This isn’t corporate asset management– this is corporate asset chaos. A data governance initiative is one of the best ways to get started in managing data as a corporate asset.
Back when I was applying to college, I’d read over college catalogs. Inevitably, each university would mention the number of books it had in its library. When I finally went to college, I realized that this metric was fairly meaningless. A dozen volumes on Grecian pottery did me no good when I was in search of a book on polymers for my mechanical engineering class.
Clients will often ask us to scope a “data inventory” project, inevitably focused on identifying and describing all the data elements contained across their different application systems. Recently a new CIO asked us to head up a “tiger team” to inventory his company’s data. He was surprised at the quantity of information needs that had been sent his way. As expected, he inquired about systems of record and data dictionaries. As you can imagine, he received multiple and conflicting answers which only exacerbated his confusion.
As a point of reference, well-known ERP systems can have in excess of 50,000 discrete data elements in their databases (never mind that some aren’t in English). As I’ve written in the past, many of these data elements have no use outside of the application itself.
Having terabyte upon terabyte of information is equally irrelevant if that data is unrelated to current business issues. The problem with a data inventory activity is that identifying and counting data elements in different systems and applications won’t necessarily solve any problems. Why? Because data across applications and packages is inconsistent: there are different names, definitions, and values, and there is no practical means of determining which data they actually have in common. This is like going to the hardware store and looking for a specific screw, but all the different screws are in one big barrel—you end up having to pick through each screw, one at time. When you find the screw, you just throw all the other screws back into the barrel.
The point of a data inventory isn’t to pick through data because it exists, but to inventory the data people actually need. If you’re going to undertake a data inventory, your output should be structured so that the next person doesn’t have to repeat your work. Identify the data that is moving across various systems, as this indicates key information that’s being shared. Categorize this data by subject area. You’ll inevitably find that there are inconsistent versions of the data, enabling you to identify data disparities. You can then begin to develop a catalog of key corporate data that will form the basis of your data dictionary.
Inventorying the data that moves between systems accomplishes two things: it identifies the most valuable data elements in use, and it will also help identify data that’s not high-value, as it’s not being shared or used. This approach also provides a way to tackle initial data quality efforts by identifying the most “active” data used by the business. It ultimately helps the data management team understand where to focus its efforts, and prioritize accordingly.
So next time someone suggests a data inventory without context or objectives, consider sending them to college to study Grecian urns.
I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.
photo by BotheredByBees
At Baseline Consulting we've been talking for several years about the concept of a data supply chain. But IT executives are only now starting to catch on to its importance.
Over the past 15 years there has been a big push to standardize on off-the-shelf software. This allowed IT organizations to buy instead of build. We've migrated from proprietary architectures to Windows and Linux standards. We've gone from custom-built applications to packaged CRM and ERP applications. IT adopted this approach because its value is automating business processes and supporting analysis– not inventing new technologies. The problem is that moving data between all of these "packaged systems" still requires custom code.
There's no question that middleware provides value: it delivers the pre-built data pipes. Unfortunately, these are toolkits requiring developers to write code to connect their packages to the pipes. Most CIOs are blissfully unaware of the amount of custom coding middleware requires. Trust me: IT spends an enormous amount of money on supporting such data migration solutions. Many IT shops still view middleware as sacred ground.
The data warehousing world has enthusiastically adopted ETL tools to reduce custom coding so they can focus on the issues of data accuracy and usability. One fact lost in translation is that ETL integrates data– it's more than just a pipe. The application world has adopted EAI, ESB, and orchestration to move data quicker. However, there's no integration. Each application is responsible for integrating the data they receive.
So, there's even more custom code. Code to connect an application to the pipes. Code to integrate and cleanup the data they receive from the pipes.
Custom code to move data around isn't the answer. Orchestration, message passing, and data movement just creates a labyrinth of pipes. There are no economies of scale. The data doesn't get better.
Walmart learned years ago that it was impractical to have a custom (and separate) distribution system for every supplier. They knew the cost benefits of a standard distribution system; this meant they needed to standardize the size of the trailers, the size of the boxes, and the way the boxes were packed and shipped. The benefits of a supply chain is that standardization occurs at the most cost effective point: the source. Walmart's distribution success was measured by its ability to accept new suppliers and manage more shipments.
Most CIOs don't recognize that they have a data supply chain. Instead of building a custom distribution system for each suppler (each business application), they should be focused on a single data supply chain. Middleware supports the creation of custom distribution solutions, but not the standardization of data. A data supply chain can only be successful if the data is standardized. Otherwise everyone is forced to write custom code to standardize, clean, and integrate the data.