The Power of Data Virtualization

20130911 Doorway

I was participating in a discussion about Data Virtualization (DV) the other day and was intrigued with the different views that everyone had about a technology that’s been around for more than 10 years. For those of you that don’t participate in IT-centric, geekfest discussions on a regular basis, Data Virtualization software is middleware that allows various disparate data sources to look like a single relational database.  Some folks characterize Data Virtualization as a software abstraction layer that removes the storage location and format complexities associated with manipulating data. The bottom line is that Data Virtualization software can make a BI (or any SQL) tool see data as though it’s contained within a single database even though it may be spread across multiple databases, XML files, and even Hadoop systems.

What intrigued me about the conversation is that most of the folks had been introduced to Data Virtualization not as an infrastructure tool that simplifies specific disparate data problems, but as the secret sauce or silver bullet for a specific application. They had all inherited an application that had been built outside of IT to address a business problem that required data to be integrated from a multitude of sources.  And in each instance, the applications were able to capitalize on Data Virtualization as a more cost effective solution for integrating detailed data. Instead of building a new platform to store and process another copy of the data, they used Data Virtualization software to query and integrate data from the individual sources systems. And each “solution” utilized a different combination of functions and capabilities.

As with any technology discussion, there’s always someone that believes that their favorite technology is the best thing since sliced bread – and they want to apply their solution to every problem.  Data Virtualization is an incredibly powerful technology with a broad array of functions that enable multi-source query processing. Given the relative obscurity of this data management technology, I thought I’d review some of the more basic capabilities supported by this technology.

Multi-Source Query Processing.  This is often referred to as Query Federation. The ability to have a single query process data across multiple data stores.

Simplify Data Access and Navigation.  Exposes data as single (virtual) data source from numerous component sources. The DV system handles the various network, SQL dialect, and/or data conversion issues.

Integrate Data “On the Fly”.  This is referred to as Data Federation. The DV server retrieves and integrates source data to support each individual query. 

Access to Non-Relational Data. The DV server is able to portray non-relational data (e.g. XML data, flat files, Hadoop, etc.) as structured, relational tables.  

Standardize and Transform Data. Once the data is retrieved from the origin, the DV server will convert the data (if necessary) into a format to support matching and integration.

Integrate Relational and Non-Relational Data. Because DV can make any data source (well, almost any) look like a relational table, this capability is implicit. Keep in mind that the data (or a subset of it) must have some sort of implicit structure.  

Expose a Data Services Interface. Exposing a web service that is attached to a predefined query that can be processed by the DV server.

Govern Ad Hoc Queries. The DV Server can monitor query submissions, run time, and even complexity – and terminate or prevent processing under specific rule-based situations.

Improve Data Security.  As a common point of access, the DV Server can support another level of data access security to address the likely inconsistencies that exist across multiple data store environments.

As many folks have learned, Data Virtualization is not a substitute for a data warehouse or a data mart.  In order for a DV Server to process data, the data must be retrieved from the origin; consequently, running a query that joins tables spread across multiple systems containing millions of records isn’t practical.  An Ethernet network is no substitute for the high speed interconnect linking a computer’s processor and memory to online storage. However, when the data is spread across multiple systems and there’s no other query alternative, Data Virtualization is certainly worth investigating.

Data Scientist Team: Q & A

ClassRoom

I presented a webinar a few weeks back that challenged the popular opinion that the only way to be successful with data science was to hire an individual that has a swiss army knife of data skills and business acumen.  (The archived webinar link is http://goo.gl/Ka1H2I )

While I can’t argue on the value of such abilities, my belief is that these types of individuals are very rare, and the benefits of data science is something that can be valued by every company. Consequently, my belief is that you can approach data science successfully through building a team of focused staff members, providing they cover 5 role areas:  Data Services, Data Engineer, Data Manager, Production Development, and the Data Scientist.

I received quite a few questions during and after the August 12th webinar,  so I thought I would devote this week’s blog to those questions (and answers).  As is always the case with a blog, feel free to comment, respond, or disagree – I’ll gladly post the feedback below.

Q: ­In terms of benefits and costs, do you have any words of wisdom in building a business case that can be taken to business leadership for funding

A:  Business case constructs vary by company.  What I encourage folks to focus on is the opportunity value in supporting a new initiative.  Justifying an initial data science initiative shouldn’t be difficult if your company already supports individuals analyzing data on their desktops.  We often find collecting the existing investment numbers and the results of your advanced analytics team (SAS, R, SPSS, etc.) often justifies delving into the world of Data Science

Q: ­One problem is that many business leaders do not have a concept of what goes into a scientific discovery process. They are not schooled as scientists.­

A: You’re absolutely correct.  Most managers are focused on establishing business process, measuring progress, and delivering results.  Discovery and exploration isn’t always a predictable process.  We often find that initial Data Science initiatives are more likely to be successful if the environment has already adopted the value of reporting and advanced analytics (numerical analysis, data mining, prediction, etc.)  If your organization hasn’t fully adopted business intelligence and desktop analysis, you may not be ready for Data Science.  If your organization already understands the value of detailed data and analysis – you might want to begin with a more focused analytic effort (e.g. identifying trend indicators, predictive details, or other modeling activities.)  We’ve seen data science deliver significant business value, but it also requires a manager that understands the complexities and issues of data exploration and discovery.

Q: ­One of the challenges that we’ve seen in my company is the desire to force fit Data Science into a traditional IT waterfall development method instead of realizing the benefits of taking a more iterative or agile approach.  Is there danger in this approach?

A:  We find that the when organizations already have an existing (and robust) business intelligence and analytics environments, there’s a tendency to follow the tried and true practices of defined requirements, documented project plans, managed development, and scheduled delivery.   One thing to keep in mind is that the whole premise of Data Science is analyzing data to uncover new patterns or knowledge.  When you first undertake a Data Science initiative, it’s about exploration and discovery, not structured deliverables.   It’s reasonable to spin up a project team (preferably using an iterative or agile methodology) once the discovery has been identified and there’s tangible business value to build and deploy a solution using the discovery.  However, it’s important to allow the discovery to happen first.

You might consider reading an article from DJ  Patil (“Building Data Science Teams“) that discusses the importance of having a Production Development role that I mentioned. This is the role that takes on the creation of a production deliverable from the raw artifacts and discoveries made by the Data Science team

­Q: It seems like your Data Engineer has a similar role and responsibility set as a Data Warehouse architect or ETL developer

A: The Data Engineers are a hybrid of sorts. They handle all of the data transformation and integration activities and they are also deeply knowledgeable of the underlying data sources and the content. We often find that the Data Warehouse Architect and ETL Developer are very knowledgeable about the data structures of source and target systems, but they aren’t typically knowledgeable on social media content, external sources, unstructured data, and the lower details of the specific data attributes.  Obviously, these skills vary from organization to organization.  If the individuals in your organization are intimate with this level of knowledge, they may be able to cover the activities associated with a Data Engineer.

Q : What is the difference between the Data Engineers and Data Management team members?­

A:  Data Engineers focus on retrieving and manipulation data from the various data stores (external and internal).  They deal with data transformation, correction, and integration.  The Data Management folks support the Data Engineers  (thus the skill overlap) but focus more on managing and tracking the actual data assets that are going to be used by data scientists and other analysts within the company (understanding the content, the values, the formats, and the idiosyncrasies).

­Q: Isn’t there a risk in building a team of folks with specialized skills (instead of having individuals with a broad set of knowledge).  With specialists, don’t we risk freezing the current state of the art, making the organization inflexible to change?    Doesn’t it also reduce everyone’s overall understanding of the goal (e.g. the technicians focus on their tools’ functions, not the actual results they’re being expected to deliver)

A: While I see your perspective, I’d suggest a slightly different view.  The premise of defining the various roles is to identify the work activities (and skills) necessary to complete a body of work.  Each role should still evolve with skill growth — to ensure individuals can handle more and more complex activities.   There will continue to be enormous growth and evolution in the world of Data Science in the variety of external data sources, number of data interfaces, and the variety of data integration tools.   Establishing different roles ensures there’s an awareness of the breadth of skills required to complete the body of work.  It’s entirely reasonable for an individual to cover multiple roles; however, as the workload increases, it’s very likely that specialization will be necessary to support the added work effort.   Henry Ford used the assembly line to revolutionize manufacturing.  He was able to utilize less skilled workers to handle the less sophisticated tasks so he could ensure his craftsmen continued to focus on more and more specialized (and complex) activities.  Data integration and management activities support (and enable) Data Science.  Specialization should be focused on the less complex (and more easily staffed) roles that will free up the Data Scientist’s time to allow them to focus on their core strengths.

Q: : ­Is this intended to be an Enterprise wide team?­

A: We’ve seen Data Science teams be positioned as an organizational resource (e.g. specific to support marketing analytics); we’ve also seen teams set up as an enterprise resource.   The decision is typically driven by the culture and needs of your company.

­Q: Where is the business orientation in the data team? Do you need someone that knows what questions to ask and then take all of the data and distill it down to insights that a CEO can implement.

A: The “business orientation” usually resides with the Data Scientist role. The Data Science team isn’t typically setup to respond to business user requests (like a traditional BI team); they are usually driven by the Data Scientist that understands and is tasked with addressing the priority needs of the company.  The Data Scientist doesn’t work in a vacuum; they have to interact with key business stakeholders on a regular basis.  However, Data Science shouldn’t be structured like a traditional applications development team either.  The teams is focused on discovery and exploration – not core IT development.  Take a look at one of the more popular articles on the topic, “Data Scientist: the sexiest job of the 21st century” by Tom Davenport and DJ Patil http://goo.gl/CmCtv9

Photo courtesy of National Archive via Flickr (Creative Commons license).

 

The Data Scientist Team

20130826DataScientistTeam

I’ve been intrigued with all of the attention that the world of Data Science has received.  It seems that every popular business magazine has published several articles and it’s become a mainstream topic at most industry conferences. One of the things that struck me as odd is that there’s a group of folks that actually believe that all of the activities necessary to deliver new business discoveries with data science can be reasonably addressed by finding individuals that have a cornucopia of technical and business skills.  One popular belief is that a Data Scientist should be able to address all of the business and technical activities necessary to identify, qualify, prove, and explain a business idea with detailed data.

If you can find individuals that comprehend the peculiarities of source data extraction, have mastered data integration techniques, understand parallel algorithms to process tens of billions of records, have worked with specialized data preparation tools, and can debate your company’s business strategy and priorities – Cool!  Hire these folks and chain their leg to the desk as soon as possible.

If you can’t, you might consider building a team that can cover the various roles that are necessary to support a Data Science initiative. There’s a lot more to Data Science than simply processing a pile of data with the latest open source framework.  The roles that you should consider include:

Data Services

Manages the various data repositories that feed data to the analytics effort.  This includes understanding the schemas, tracking the data content, and making sure the platforms are maintained. Companies with existing data warehouses, data marts, or reporting systems typically have a group of folks focused on these activities (DBAs, administrators, etc.).

Data Engineer

Responsible for developing and implementing tools to gather, move, process, and manage data. In most analytics environments, these activities are handled by the data integration team.  In the world of Big Data or Data Science, this isn’t just ETL development for batch files; it also includes processing data streams and handling the cleansing and standardization of numerous structured and unstructured data sources.

Data Manager

Handles the traditional data management or source data stewardship role; the focus is supporting development access and manipulation of data content. This includes tracking the available data sources (internal and external), understanding the location and underlying details of specific attributes, and supporting developers’ code construction efforts.

Production Development

Responsible for packaging the Data Scientist discoveries into a production ready deliverable. This may include (one or) many components: new data attributes, new algorithms, a new data processing method, or an entirely new end-user tool. The goal is to ensure that the discoveries deliver business value.

Data Scientist

The team leader and the individual that excels at analyzing data to help a business gain a competitive edge. They are adept at technical activities and equally qualified to lead a business discussion as to the benefits of a new business strategy or approach. They can tackle all aspects of a problem and often lead the interdisciplinary team to construct an analytics solution.

There’s no shortage of success stories about the amazing data discoveries uncovered by Data Scientists.  In many of those companies, the Data Scientist didn’t have an incumbent data warehousing or analytics environment; they couldn’t pick up the phone to call a data architect, there wasn’t any metadata documentation, and their company didn’t have a standard set of data management tools.  They were on their own.  So, the Data Scientist became “chief cook and bottle washer” for everything that is big data and analytics.

Most companies today have institutionalized data analysis; there are multiple data warehouses, lots of dashboards, and even a query support desk.  And while there’s a big difference between desktop reporting and processing social media feedback, much of the “behind the scenes” data management and data integration work is the same.  If your company already has an incumbent data and analytics environment, it makes sense to leverage existing methods, practices, and staff skills.  Let the Data Scientists focus on identifying the next big idea and the heavy analytics; let the rest of the team deal with all of the other work.

The Misunderstanding of Master Data Management

BrightShinyMoon2

Not long ago, I was asked to review a client’s program initiative that was focused on constructing a new customer repository that would establish a single version of truth.  The client was very excited about using Master Data Management (MDM) to deliver their new customer view.  The problem statement was well thought out: their customer data is spread across 11 different systems; users and developers retrieve data from different sources; reports reflect conflicting details; and an enormous amount of manual effort is required to manage the data.  The project’s benefits were also well thought out:  increased data quality, improved reporting accuracy, and improved end user data access.    And, (as you can probably imagine), the crowning objective of the project was going to be creating a Single View of the Customer.  The program’s stakeholders had done a good job of communicating the details:  they reviewed the existing business challenges, identified the goals and objectives, and even provided a summary of high-level requirements.  They were going to house all of their customer data on an MDM hub.  There was only one problem:  they needed a customer data mart, not an MDM hub.  

I hate the idea of discussing technical terms and details with either business or IT staff.  It gets particularly uncomfortable when someone was misinformed about a new technology (and this happens all the time when vendors roll out new products to their sales force).  I won’t count the number of times that I’ve seen projects implemented with the wrong technology, because the organization wanted to get a copy of the latest and greatest technical toy.  A few of my colleagues and I used to call this the “bright shiny project syndrome”.   While it’s perfectly acceptable to acquire a new technology to solve a problem, it can be a very expensive to purchase a technology and force fit a solution that it doesn’t easily address.

It’s frequent that folks confuse the function and purpose of Master Data Management with Data Warehousing.  I suspect the core of the problem is that when folks hear about the idea of “reference data” or a “golden record”, they have this mental picture of a single platform containing all of the data.  While I can’t argue with the benefit of having all the data in one place (data warehousing has been around for more than 20 years), that’s not what MDM is about.   Data Warehousing became popular because of its success in storing a company’s historical data to support cross-functional (multi-subject area) analysis.   MDM is different; it’s focused on reconciling and tracking a single subject area’s reference data across the multitude of systems that create that data.  Some examples of a subject area include customer, product, and location.

If you look at the single biggest obstacle in data integration, it’s dealing with all of the complexity of merging data from different systems.  It’s fairly common for different application systems to use different reference data (The CRM system, the Sales system, and the Billing system each use different values to identify a single customer). The only way to link data from these different systems is to compare the reference data (names, addresses, phone numbers, etc.) from each system with the hope that there are enough identical values in each to support the match.   The problem with this approach is that it simply doesn’t work when a single individual may have multiple name variations, multiple addresses, and multiple phone numbers.  The only reasonable solution is the use of advanced algorithms that are specially designed to support the processing and matching of specific subject area details.  That’s the secret sauce of MDM – and that’s what’s contained within a commercial MDM product.  

The MDM hub not only contains master records (the details identifying each individual subject area entry), it also contains a cross reference list of each individual subject area entry along with the linkage details to every other application system.  And, it’s continually updated as the values change within each individual system.  The idea is that an MDM hub is a high performance, transactional system focused on matching and reconciling subject area reference data.  While we’ve illustrated how this capability simplifies data warehouse development, this transactional capability also enables individual application systems to move and integrate data between transactional systems more efficiently too. 

The enormous breadth and depth of corporate data makes it impractical to store all of our data within a single system.  It’s become common practice to prune and trim the contents of our data warehouses to limit the breadth and history of data.   If you consider recent advances with big data, cloud computing, and SaaS, it becomes even more apparent that storing all of a company’s subject area data in a single place isn’t practical.  That’s one of the reasons that most companies have numerous data marts and operational applications integrating and loading their own data to support their highly diverse and unique business needs.  An MDM hub is focused on tracking specific subject area details across multiple systems to allow anyone to find, gather, and integrate the data they need from any system.

I recently crossed paths with the above mentioned client.  Their project was wildly successful – they ended up deploying both an MDM hub and a customer data mart to address their needs.  They mentioned that one of the “aha” moments that occurred during our conversation was when they realized that they needed to refocus everyone’s attention towards the business value and benefits of the project instead of the details and functions of MDM. While I was thrilled with their program’s success,  I was even more excited to learn that someone was finally able to compete against the “bright shiny project syndrome” and win. 

Photo “Dirt Pile 2” courtesy of CoolValley via Flickr (Creative Commons license).

Role of an Executive Sponsor

It’s fairly common for companies to assign Executive Sponsors to their large projects.  “Large” typically reflects budget size, the inclusion of cross-functional teams, business impact, and complexity.  The Executive Sponsor isn’t the person running and directing the project on a day-to-day basis; they’re providing oversight and direction.  He monitors project progress and ensures that tactics are carried out to support the project’s goals and objectives.  He has the credibility (and authority) to ensure that the appropriate level of attention and resources are available to the project throughout its entire life.

While there’s nearly universal agreement on the importance of an Executive Sponsor, there seems to be limited discussion about the specifics of the role.  Most remarks seem to dwell on the importance on breaking down barriers, dealing with roadblocks, and effectively reacting to project obstacles.  While these details make for good PowerPoint presentations, project success really requires the sponsor to exhibit a combination of skills beyond negotiation and problem resolution to ensure project success.   Here’s my take on some of the key responsibilities of an Executive Sponsor.

Inspire the Stakeholder Audience

Most executives are exceptional managers that understand the importance of dates and budgets and are successful at leading their staff members towards a common goal.  Because project sponsors don’t typically have direct management authority over the project team, the methods for leadership are different.  The sponsor has to communicate, captivate, and engage with the team members throughout all phases of the project.  And it’s important to remember that the stakeholders aren’t just the individual developers, but the users and their management.  In a world where individuals have to juggle multiple priorities and projects, one sure-fire way to maintain enthusiasm (and participation) is to maintain a high-level of sponsor engagement.

Understand the Project’s Benefits

Because of the compartmentalized structure of most organizations, many executives aren’t aware of the details of their peer organizations. Enterprise-level projects enlist an Executive Sponsor to ensure that the project respects (and delivers) benefits to all stakeholders. It’s fairly common that any significantly sized project will undergo scope change (due to budget challenges, business risks, or execution problems).  Any change will likely affect the project’s deliverables as well as the perceived benefits to the different stakeholders. Detailed knowledge of project benefits is crucial to ensure that any change doesn’t adversely affect the benefits required by the stakeholders.

Know the Project’s Details

Most executives focus on the high-level details of their organization’s projects and delegate the specifics to the individual project manager.  When projects cross organizational boundaries, the executive’s tactics have to change because of the organizational breadth of the stakeholder community. Executive level discussions will likely cover a variety of issues (both high-level and detailed).  It’s important for the Executive Sponsor to be able to discuss the brass tacks with other executives; the lack of this knowledge undermines the sponsor’s credibility and project’s ability to succeed.

Hold All Stakeholders Accountable

While most projects begin with everyone aligned towards a common goal and set of tactics, it’s not uncommon for changes to occur. Most problems occur when one or more stakeholders have to adjust their activities because of an external force (new priorities, resource contention, etc.). What’s critical is that all stakeholders participate in resolving the issue; the project team will either succeed together or fail together. The sponsor won’t solve the problem; they will facilitate the process and hold everyone accountable.

Stay Involved, Long Term

The role of the sponsor isn’t limited to supporting the early stages of a project (funding, development, and deployment); it continues throughout the life of the project.  Because most applications have a lifespan of no less than 7 years, business changes will drive new business requirements that will drive new development.  The sponsor’s role doesn’t diminish with time – it typically expands.

The overall responsibility set of an Executive Sponsor will likely vary across projects. The differences in project scope, company culture, business process, and staff resources across individual projects inevitably affect the role of the Executive Sponsor. What’s important is that the Executive Sponsor provides both strategic and tactical support to ensure a project is successful. An Executive Sponsor is more than the project’s spokesperson; they’re the project CEO that has equity in the project’s outcome and a legitimate responsibility for seeing the project through to success.

Photo “American Alligator Crossing the Road at Canaveral National 
Seashore”courtesy of Photomatt28 (Matthew Paulson) via Flickr 
(Creative Commons license).

Project Success = Data Usability

One of the challenges in delivering successful data-centric projects (e.g. analytics, BI, or reporting) is realizing that the definition of project success differs from traditional IT application projects.  Success for a traditional application (or operational) project is often described in terms of transaction volumes, functional capabilities, processing conformance, and response time; data project success is often described in terms of business process analysis, decision enablement, or business situation measurement.  To a business user, the success of a data-centric project is simple: data usability.

It seems that most folks respond to data usability issues by gravitating towards a discussion about data accuracy or data quality; I actually think the more appropriate discussion is data knowledge.  I don’t think anyone would argue that to make data-enabled decisions, you need to have knowledge about the underlying data.  The challenge is understanding what level of knowledge is necessary.  If you ask a BI or Data Warehouse person, their answer almost always includes metadata, data lineage, and a data dictionary.  If you ask a data mining person, they often just want specific attributes and their descriptions — they don’t care about anything else.  All of these folks have different views of data usability and varying levels (and needs) for data knowledge.

One way to improve data usability is to target and differentiate the user audience based on their data knowledge needs.  There are certainly lots of different approaches to categorizing users; in fact, every analyst firm and vendor has their own model to describe different audience segments.  One of the problems with these types of models is that they tend to focus heavily on the tools or analytical methods (canned reports, drill down, etc.) and ignore the details of data content and complexity. The knowledge required to manipulate a single subject area (revenue or customer or usage) is significantly less than the skills required to manipulate data across 3 subject areas (revenue, customer, and usage).  And what exacerbates data knowledge growth is the inevitable plethora of value gaps, inaccuracies, and inconsistencies associated with the data. Data knowledge isn’t just limited to understanding the data; it includes understanding how to work around all of the imperfections.

Here’s a model that categories and describes business users based on their views of data usability and their data knowledge needs

Level 1: “Can you explain these numbers to me?”

This person is the casual data user. They have access to a zillion reports that have been identified by their predecessors and they focus their effort on acting on the numbers they get. They’re not a data analyst – their focus is to understand the meaning of the details so they can do their job. They assume that the data has been checked, rechecked, and vetted by lots of folks in advance of their receiving the content. They believe the numbers and they act on what they see.

Level 2: “Give me the details”

This person has been using canned reports, understands all the basic details, and has graduated to using data to answer new questions that weren’t identified by their predecessors.  They need detailed data and they want to reorganize the details to suit their specific needs (“I don’t want weekly revenue breakdowns – I want to compare weekday revenue to weekend revenue”).  They realize the data is imperfect (and in most instances, they’ll live with it).  They want the detail.

Level 3: “I don’t believe the data — please fix it”

These folks know their area of the business inside/out and they know the data. They scour and review the details to diagnose the business problems they’re analyzing.  And when they find a data mistake or inaccuracy, they aren’t shy about raising their hand. Whether they’re a data analyst that uses SQL or a statistician with their favorite advanced analytics algorithms, they focus on identifying business anomalies.  These folks are the power users that are incredibly valuable and often the most difficult for IT to please.

Level 4: “Give me more data”

This is subject area graduation.  At this point, the user has become self-sufficient with their data and needs more content to address a new or more complex set of business analysis needs. Asking for more data – whether a new source or more detail – indicates that the person has exhausted their options in using the data they have available.  When someone has the capacity to learn a new subject area or take on more detailed content, they’re illustrating a higher level of data knowledge.

One thing to consider about the above model is that a user will have varying data knowledge based on the individual subject area.  A marketing person may be completely self-sufficient on revenue data but be a newbie with usage details.  A customer support person may be an expert on customer data but only have limited knowledge of product data.  You wouldn’t expect many folks (outside of IT) to be experts on all of the existing data subject areas. Their knowledge is going to reflect the breadth of their job responsibilities.

As someone grows and evolves in business expertise and influence, it’s only natural that their business information needs would grow and evolve too.  In order to address data usability (and project success), maybe it makes sense to reconsider the various user audience categories and how they are defined.  Growing data knowledge isn’t about making everyone data gurus; it’s about enabling staff members to become self-sufficient in their use of corporate data to do their jobs.

Photo “Ladder of Knowledge” courtesy of degreezero2000 via Flickr (Creative Commons license).

The Formula for Analytics Success: Data Knowledge


Companies spend a small fortune continually investing and reinvesting in making their business analysts self-sufficient with the latest and greatest analytical tools. Most companies have multiple project teams focused on delivering tools to simplify and improve business decision making. There are likely several standard tools deployed to support the various data analysis functions required across the enterprise: canned/batch reports, desktop ad hoc data analysis, and advanced analytics. There’s never a shortage of new and improved tools that guarantee simplified data exploration, quick response time, and greater data visualization options, Projects inevitably include the creation of dozens of prebuilt screens along with a training workshop to ensure that the users understand all of the new whiz bang features associated with the latest analytic tool incarnation.  Unfortunately, the biggest challenge within any project isn’t getting users to master the various analytical functions; it’s ensuring the users understand the underlying data they’re analyzing.

If you take a look at the most prevalent issue with the adoption of a new business analysis tool is the users’ knowledge of the underlying data.  This issue becomes visible with a number of common problems:  the misuse of report data, the misunderstanding of business terminology, and/or the exaggeration of inaccurate data.  Once the credibility or usability of the data comes under scrutiny, the project typically goes into “red alert” and requires immediate attention. If ignored, the business tool quickly becomes shelfware because no one is willing to take a chance on making business decisions based on risky information.

All too often the focus on end user training is tool training, not data training. What typically happens is that an analyst is introduced to the company’s standard analytics tool through a “drink from a fire hose” training workshop.  All of the examples use generic sales or HR data to illustrate the tool’s strengths in folding, spindling, and manipulating the data.  And this is where the problem begins:  the vendor’s workshop data is perfect.  There’s no missing or inaccurate data and all of the data is clearly labeled and defined; classes run smoothly, but it just isn’t reality  Somehow the person with no hands-on data experience is supposed to figure out how to use their own (imperfect) data. It’s like someone taking their first ski lesson on a cleanly groomed beginner hill and then taking them up to the top of an a black diamond (advanced) run with step hills and moguls.  The person works hard but isn’t equipped to deal with the challenges of the real world.  So, they give up on the tool and tell others that the solution isn’t usable.

All of the advanced tools and manipulation capabilities don’t do any good if the users don’t understand the data. There are lots of approaches to educating users on data.  Some prefer to take a bottom-up approach (reviewing individual table and column names, meanings, and values) while others want to take a top-down approach (reviewing subject area details, the associated reports, and then getting into the data details).  There are certainly benefits of one approach over the other (depending on your audience); however, it’s important not to lose sight of the ultimate goal: giving the users the fundamental data knowledge they need to make decisions.  The fundamentals that most users need to understand their data include a review of

The above details may seem a bit overwhelming if you consider that most companies have mature reporting environments and multi-terabyte data warehouses.  However, we’re not talking about training someone to be an expert on 1000 data attributes contained within your data warehouse; we’re talking about ensuring someone’s ability to use an initial set of reports or a new tool without requiring 1-on-1 training.  It’s important to realize that the folks with the greatest need for support and data knowledge are the newbies, not the experienced folks.

There are lots of options for imparting data knowledge to business users:  a hands-on data workshop, a set of screen videos showing data usage examples, or a simple set of web pages containing definitions, textual descriptions, and screen shots. Don’t get wrapped up in the complexities of creating the perfect solution – keep it simple.  I worked with a client that deployed their information using a set of pages constructed with PowerPoint that folks could reference in a the company’s intranet. If your users have nothing – don’t’ worry about the perfect solution – give them something to start with that’s easy to use.

Remember that the goal is to build users’ data knowledge that is sufficient to get them to adopt and use the company’s analysis tools.  We’re not attempting to convert everyone into data scientists; we just want them to use the tools without requiring 1-on-1 training to explain every report or data element.

Photo courtesy of NASA.  Nasa Ames Research Center engineer H Julian “Harvey” Allen illustrating data knowledge (relating to capsule design for the Mercury program)

%d bloggers like this: