I just read this article by Ethan Knox, “Is Your Company Too Dumb to be Data Driven” and was intrigued to read what many people have discussed for years. I’ve spent nearly half my career helping clients make the transition from running the business by tribal knowledge and gut instinct to running the business by facts and numbers. It’s a hard transition. One that takes vision, motivation, discipline, and courage to change. It also takes a willingness to learn something new.
While this article covers a lot of ground, I wanted to comment on one of points made in the article: the mistake of “build it and they will come”. This occurs when an organization is enthusiastic about data and decides to build a data warehouse (or data lake) and load it with all the data from the company’s core application systems (sales, finance, operations, etc.) The whole business case depends on the users flocking to the system, using new business intelligence or reporting tools, and uncovering numerous high value business insights. All too often, the results reflect a large monolithic data platform that contains lots of content but hasn’t been designed to support analysis or decision making by the masses.
There are numerous problems with this approach – and the path to data and analytics enlightenment is littered with mistakes where companies took this approach. Don’t assume that successful companies that have embraced data and analytics didn’t make this mistake (it’s a very common mistake). Successful companies were those that were willing to learn from their mistakes – and have a culture where new project efforts are carefully scoped to allow mistakes, learning, and evolution. It’s not that they’re brilliant; successful companies understand that transitioning to being data driven company requires building knowledge. And, the process of learning takes time, includes mistakes, requires self-analysis, and must be managed and mentored carefully. They design their projects assuming mistakes and surprises occur, so they fail fast and demand continual measurement and corrective action. It’s not about the methodology or development approach. A fail-fast philosophy can work with any type of development methodology (agile, iterative, waterfall). The path to data enlightenment will include lots of mistakes.
Do you remember high school math? When you were presented with a new concept, you were given homework that allowed you to learn, gain experience, and understand the concept through the act of “doing”. Homework was often graded based on effort, not accuracy (if you did it, you got credit, whether or not it was correct). Where is it written that (upon graduation) learning something new wouldn’t require the act of “doing” and making mistakes to gain enlightenment? By the way, who has ever succeeded without making mistakes?
The point the article frequently references it that business engagement is critical. It’s not about the users participating a few times (requirements gathering and user acceptance testing); it’s about users being engaged to review results and participate in the measurement and corrective action. It’s about evolving from a culture where the relationship is customer/ provider to a team where everyone succeeds or fails based on business measurement.
It’s not that a company is too dumb to succeed with data; it’s that they’re often too fearful of mistakes to succeed. And in the world of imperfect data, exploding data volumes, frequent technology changes, and a competitive business environment, mistakes are an indication of learning. Failure isn’t a reflection of mistakes, it’s a reflection of poor planning, lack of measurement, and an inability to take corrective action.
The concept of “Production” in the area of Information Technology is well understood. It means something (usually an application or system) is ready to support business processing in a reliable manner. Production environments undergo thorough testing to ensure that there’s minimal likelihood of a circumstance where business activities are affected. The Production label isn’t thrown around recklessly; if a system is characterized as Production, there are lots of business people dependent on those systems to get their job done.
In order to support Production, most IT organizations have devoted resources focused solely on maintaining Production systems to ensure that any problem is addressed quickly. When user applications are characterized as Production, there’s special processes (and manpower) in place to address installation, training, setup, and ongoing support. Production systems are business critical to a company.
One of the challenges in the world of data is that most IT organizations view their managed assets as storage, systems, and applications. Data is treated not as an asset, but as a byproduct of an application. Data storage is managed based on application needs (online storage, archival, backup, etc.) and data sharing is handled as a one-off activity. This might have made sense in the 70’s and 80’s when most systems were vendor specific and sharing data was rare; however, in today’s world of analytics and data-driven decision making, data sharing has become a necessity. We know that every time data is created, there are likely 10-12 business activities requiring access to that data.
Data sharing is a production business need.
Unfortunately, the concept of data sharing in most companies is a handled as a one-off, custom event. Getting a copy of data often requires tribal knowledge, relationships, and a personal request. While there’s no arguing that many companies have data warehouses (or data marts, data lakes, etc.), adding new data to those systems is where I’m focused. Adding new data or integrating 3rd party content into a report takes a long time because data sharing is always an afterthought.
Think I’m exaggerating or incorrect? Ask yourself the following questions…
- Is there a documented list of data sources, their content, and a description of the content at your company?
- Do your source systems generate standard extracts, or do they generate 100s (or 1000’s) of nightly files that have been custom built to support data sharing?
- How long does it take to get a copy of data (that isn’t already loaded on the data warehouse)?
- Is there anyone to contact if you want to get a new copy of data?
- Is anyone responsible for ensuring that the data feeds (or extracts) that currently exist are monitored and maintained?
While most IT organizations have focused their code development efforts on reuse, economies-of-scale, and reliability, they haven’t focused their data development efforts in that manner. And one of the most visible challenges is that many IT organizations don’t have a funding model to support data development and data sharing as a separate discipline. They’re focused on building and delivering applications, not building and delivering data. Supporting data sharing as a production business need means adjusting IT responsibilities and priorities to reflect data sharing as a responsibility. This means making sure there are standard extracts (or data interfaces) that everyone can access, data catalogs available containing source system information, and staff resources devoted to sharing and supporting data in a scalable, reliable, and cost-efficient manner. It’s about having an efficient data supply chain to share data within your company. It’s because data sharing is a production business need.
Or, you could continue building everything in a one-off custom manner.
This blog is the final installment in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the component Govern and the details associated with supporting a Data Governance initiative as part of an overall Data Strategy.
The definition of Govern is:
“Establishing, communicating and monitoring information practices to ensure effective data sharing, usage, and protection”
As you’re likely aware, Data Governance is about establishing (and following) policies, rules, and all of the associated rigor necessary to ensure that data is usable, sharable, and that all of the associated business and legal details are respected. Data Governance exists because data sharing and usage is necessary for decision making. And, the reason that Data Governance is necessary is because the data is often being used for a purpose outside of why it was collected.
I’ve identified 5 facets about Data Governance to consider when developing your Data Strategy. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely want to consider different options for each. Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.
Information policies are high level information-oriented objectives that your company (or organization, or “governing body”) identify. Information policies act as boundaries or guard rails to guide all of the detailed (and often tactical) rules to identify required and acceptable data-oriented behavior. To offer context, some examples of the information policies that I’ve seen include
- “All customer data will be protected from unauthorized use”.
- “User data access should be limited to ‘systems of record’(when available)”.
- “All data shipped into and out of the company must be processed by the IT Data Onboarding team”.
It’s very common for Data Governance initiatives to begin with focusing on formalizing and communicating a company’s information policies.
Business Data Rules
Rules are specific lower-level details that explain what a data user (or developer) is and isn’t allowed to do. Business data rules (also referred to as “business rules”) can be categorized into one of four types:
- These are the “things” that represent the business details that we measure, track, and analyze. (e.g. a customer, a purchase, a product).
- The details that describe the terms and related details about a business (e.g. The customer purchases a product, Products are sold at a store location).
- These are the details associated with the various items and actions within a company (e.g. The company can only sell a product that is in inventory).
- The distillation or generation of new rules based on other rules. (e.g. Rule: A product can be purchased or returned by a customer. Derivation: A product cannot be returned unless it was purchased from the company).
While the implementation of rules is often the domain of a data administration (or a logical data modeling) team, data governance is often responsible for establishing and managing the process for introducing, communicating, and updating rules.
The term quality is often referred to as “conformance to requirements”. Data Acceptance is a similar concept: the details (or rules) and process applied against data to ensure it is suitable for the use intended. The premise of data acceptance is identifying the minimum details necessary to ensure that data can be used or processed support the associated business activities. Some examples of data acceptance criteria include
- All data values must be non-null.
- All fields within a record must reflect a value within a defined range of values for that field (or business term).
- The product’s price must be a numeric value that is non-zero and non-negative.
- All addresses must be valid mailable addresses.
In order to correct, standardize, or cleanse data, data acceptance for a specific business value (or term) must be identified.
A Data Governance Mechanism is the method (or process) to identify a new rule, process, or detail to support Data Governance. The components of a mechanisms may include the process definition (or flow), the actors, and their decision rights.
This is an area where many Data Governance initiatives fail. While most Governance teams are very good in building new policies, rules, processes, and the associated rigor, they often forget to establish the mechanisms to allow all of the Governance details to be managed, maintained, and updated. This is critically important because as an organization evolves and matures with Data Governance, it may outgrow many of the initial rules and practices. Establishing a set of mechanisms to support modifying and updating existing rules and practices is important to supporting the growth and evolution of a Data Governance environment
The strength and success of Data Governance shouldn’t be measured by the quantity of rules or policies. The success of Data Governance is reflected by the adoption of the rules and processes that are established. Consequently, it’s important for the Data Governance team to continually measure and report adoption levels to ensure the Data Governance details are applied and followed. And where they challenges in adoption, mechanisms exist to allow stakeholders to adjust and update the various aspects of Data Governance to support the needs of the business and the users.
Data Governance will always be a polarizing concept. Whether introduced as part of a development methodology, included within a new data initiative, required to address a business compliance need, or positioned within a Data Strategy, Data Governance is always going to ruffle feathers.
Because folks are busy and they don’t want to be told that they need to have their work reviewed, modified, or approved. Data Governance is an approach (and arguably a method, practice, and process) to ensure that data usage and sharing aligns with policy, business rules, and the law. Data Governance is the “rules of the road” for data.
This blog is 4th in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the component Assemble and the numerous details involved with sourcing, cleansing, standardizing, preparing, integrating, and moving the data to make it ready to use.
The definition of Assemble is:
“Cleansing, standardizing, combining, and moving data residing in multiple locations and producing a unified view”
In the Data Strategy context, Assemble includes all of the activities required to transform data from its host-oriented application context to one that is “ready to use” and understandable by other systems, applications, and users.
Most data used within our companies is generated from the applications that run the company (point-of-sale, inventory management, HR systems, accounting) . While these applications generate lots of data, their focus is on executing specific business functions; they don’t exist to provide data to other systems. Consequently, the data that is generated is “raw” in form; the data reflects the specific aspects of the application (or system of origin). This often means that the data hasn’t been standardized, cleansed, or even checked for accuracy. Assemble is all of the work necessary to convert data from a “raw” state to one that is ready for business usage.
I’ve identified 5 facets to consider when developing your Data Strategy that are commonly employed to make data “ready to use”. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely want to consider different options for each. Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.
Identification and Matching
Data integration is one of the most prevalent data activities occurring within a company; it’s a basic activity employed by developers and users alike. In order to integrate data from multiple sources, it’s necessary to determine the identification values (or keys) from each source (e.g. the employee id in an employee list, the part number in a parts list). The idea of matching is aligning data from different sources with the same identification values. While numeric values are easy to identify and match (using the “=” operator), character-based values can be more complex (due to spelling irregularities, synonyms, and mistakes).
Even though it’s highly tactical, Identification and matching is important to consider within a Data Strategy to ensure that data integration is processed consistently. And one of the (main) reasons that data variances continue to exist within companies (despite their investments in platforms, tools, and repositories) is because the need for standardized Identification and Matching has not been addressed.
Survivorship is a pretty basic concept: the selection of the values to retain (or survive) from the different sources that are merged. Survivorship rules are often unique for each data integration process and typically determined by the developer. In the context of a data strategy, it’s important to identify the “systems of reference” because the identification of these systems provide clarity to developers and users to understand which data elements to retain when integrating data from multiple systems.
Standardize / Cleanse
The premise of data standardization and cleansing is to identify inaccurate data and correct and reformat the data to match the requirements (or the defined standards) for a specific business element. This is likely the single most beneficial process to improve the business value (and the usability) of data. The most common challenge to data standardization and cleansing is that it can be difficult to define the requirements. The other challenge is that most users aren’t aware that their company’s data isn’t standardized and cleansed as a matter of practice. Even though most companies have multiple tools to cleanup addresses, standardize descriptive details, and check the accuracy of values, the use of these tools is not common.
Wikipedia defines reference data as data that is used to classify or categorize other data. In the context of a data strategy, reference data is important because it ensures the consistency of data usage and meaning across different systems and business areas. Successful reference data means that details are consistently identified, represented, and formatted the same way across all aspects of the company (if the color of a widget is “RED”, then the value is represented as “RED” everywhere – not “R” in product information system, 0xFF0000 in inventory system, and 0xED2939 in product catalog). A Reference Data initiative is often aligned with a company’s data strategy initiative because of its impact to data sharing and reuse.
The idea of movement is to record the different systems that a data element touches as it travels (and is processed) after the data element is created. Movement tracking (or data lineage) is quite important when the validity and accuracy of a particular data value is questioned. And in the current era of heightened consumer data privacy and protection, the need for data lineage and tracking of consumer data within a company is becoming a requirement (and it’s the law in California and the European Union).
The dramatic increase in the quantity and diversity of data sources within most companies over the past few years has challenged even the most technology advanced organizations. It’s not uncommon to find one of the most visible areas of user frustration to be associated with accessing new (or additional) data sources. Much of this frustration occurs because of the challenge in sourcing, integrating, cleansing, and standardizing new data content to be shared with users. As is the case with all of the other components, the details are easy to understand, but complex to implement. A company’s data strategy has to evolve and change when data sharing becomes a production business requirement and users want data that is “ready to use”.
During my time teaching Data Strategy in the class room, I’m frequently asked the question, “how do I know if I need a data strategy?” For those of you that are deep thinkers, business strategists, or even data architects, I suspect your answer is either “yes!” or “why not?”.
When I’m asked that question, I actually think there’s a different question at hand, “Should I invest the time in developing a data strategy instead of something else?”
In today’s business world, there’s not a shortage of “to do list” items. So, prioritizing the development of a Data Strategy means deprioritizing some other item. In order to understand the relative priority and benefit of a Data Strategy initiative, take a look at the need, pain, or problem you’re addressing along with the quantity of people affected. Your focus should be understanding how a Data Strategy initiative will benefit the team members’ ability to do their job.
To get started, I usually spend time up front interviewing folks to understand the strengths, weaknesses, challenges, and opportunities that exist with data within a company (or organization). Let me share 5 questions that I always ask.
- Is the number of users (or organizations) building queries/reports to analyze data growing?
- Are there multiple reports containing conflicting information?
- Can a new staff member find and use data on their own, or does it require weeks or months of staff mentoring?
- Is data systematically inspected for accuracy (and corrected)? Is anyone responsible for fixing “broken data”?
- Is anyone responsible for data sharing?
While you might think these questions are a bit esoteric, each one has a specific purpose. I’m a big fan of positioning any new strategy initiative to clearly identify the problems that are going to be solved. If you’re going to undertake the development of a Data Strategy, you want to make certain that you will improve staff members’ ability to make decisions and be more effective at their jobs. These questions will help you identify where people struggle getting the job done, or where there’s an unquantified risk with using data to make decisions.
So, let me offer an explanation of each question.
- “Is the number of users (or organizations) building queries/reports to analyze data growing”
The value of a strategy is directly proportional to the number of people that are going to be affected. In the instance of a data strategy, it’s valuable to understand the number of people that use data (hands-on) to make decisions or do their jobs. If the number is small or decreasing, a strategy initiative may not be worth the investment in time and effort. The larger the number, the greater the impact to the effectiveness (and productivity) to the various staff members.
- “Are there multiple reports containing conflicting information? “
If you have conflicting details within your company that means decisions are made with inaccurate data. That also means that there’s mistrust of information and team members are spending time confirming details. That’s business risk and a tremendous waste of time.
- “Can a new staff member find and use data…”
If a new staff member can’t be self-sufficient after a week or two on the job (when it comes to data access and usage), you have a problem. That’s like someone joining the company and not having access to office supplies, a parking space, and email. And, if the only way to learn is to beg for time for other team members – your spending time with two people not doing their job. It’s a problem that’s being ignored.
- “Is data systematically inspected for accuracy (and corrected)? …”
This item is screaming for attention. If you’re in a company that uses data to make decisions, and no one is responsible for inspecting the content, you have a problem. Think about this issue another way: would you purchase hamburger at the grocery store if there was a sign that stated “Never inspected. May be spoiled. Not our responsibility”?
- Is anyone responsible for data sharing?
This item gets little attention in most companies and is likely the most important of all the questions. If data is a necessary ingredient in decision making and there isn’t anyone actively responsible for ensuring that new data assets are captured, stored, tracked, managed, and shared, you’re saying that data isn’t a business asset. (How many assets in the company aren’t tied to someone’s responsibilities?)
If the answer to all of the questions is “no” – great. You’re in an environment where data is likely managed in a manner that supports a multitude of team members’ needs across different organizations. If you answered “yes” to a single question, it’s likely that an incremental investment in a tactical data management effort would be helpful. If more than 1 question is answered “yes”, your company (and the team) will benefit from a Data Strategy initiative.
I’ve been consulting in the data management space for quite a few years, and I’m often asked about the importance and need for a Data Strategy.
All too often, the idea of “strategy” brings the images of piles of papers, academics-styled charts, and a list of unachievable goals identifying the topic at hand, but not reflecting reality. Developing a strategy isn’t about identifying perfection – it’s about identifying a set of goals that address problems and needs that require attention. A solid data strategy isn’t about identifying perfection, it’s about identifying a set of goals that are achievable and good enough to improve your data environment. A data strategy is also about identifying the tasks and activities necessary to achieve those goals. A data strategy is more than the finish line, it’s about the path of the journey. And, it’s about making sure the journey and goal are possible.
Companies spend a fortune on data. They purchase servers and storage farms to store the data, database management systems to manage the data, transformation tools to convert and transform the data, data quality tools to fix and standardize the content, and treasure trove of analytical tools to present content that can be understood by business people. Given all of the activities, the players, and the content, why would you not want a plan?
Unfortunately, few organizations have a Data Strategy. They have lots of technology plans and roadmaps. They have platform and server plans; they have DBMS standards; they have storage strategies; they likely have analytical tool plans. While these are valuable, they are typically focused on an organization or function with minimal concern for all of the related upstream and downstream activities (how usable is a data warehouse if the data exists as multiple copies with different names and different formats, and hasn’t been checked/fixed for accuracy?) A data strategy is a plan that ensures that data is easy to find, easy to identify, easy to use, and easy to share across the company and across multiple functions.
Information technologists are exceptionally strong in the world of applications, tools, and platforms. They understand the importance of ensuring “reusability” and the benefit of an “economies-of-scale” approach. These are both just nice sound bites focused on making sure that new development work doesn’t always require reinvention. Application strategies include identifying standards (tools, platforms, storage locations, etc.) and repeatable methods to ensure efficient construction and delivery of data that can be serviced, maintained, and upgraded. An assembly line of sorts.
The challenge with most data environments is that a data strategy rarely exists; there is no repeatable methods and practices. Every new request requires building data and the associated deliverables from scratch. And, once delivered, there’s a huge testing and confirmation effort to ensure that the data is accurate. If you had a data strategy, you’d have reusable data, repeatable methods, and the details would be referenceable online instead of through tribal knowledge. And delivery efficiency and cost would improve over time.
Why do you need a data strategy? Because the cost of data is growing –and it should be shrinking. The cost of data processing has shrunk, the cost of data storage has decreased dramatically, but the cost of data delivery continues to grow. A data strategy focuses on delivering data that is easy to find, easy to use, and easy to share.
I’m a bit surprised with all of the recent discussion and debate about Shadow IT. For those of you not familiar with the term, Shadow IT refers to software development and data processing activities that occur within business unit organizations without the blessing of the Central IT organization. The idea of individual business organizations purchasing technology, hiring staff members, and taking on software development to address specific business priorities isn’t a new concept; it’s been around for 30 years.
When it comes to the introduction of technology to address or improve business process, communications, or decision making, Central IT has traditionally not been the starting point. It’s almost always been the business organization. Central IT has never been in the position of reengineering business processes or insisting that business users adopt new technologies; that’s always been the role of business management. Central IT is in the business of automating defined business processes and reducing technology costs (through the use of standard tools, economies-of-scale methods, commodity technologies). It’s not as though Shadow IT came into existence to usurp the authority or responsibilities of the IT organization. Shadow IT came into existence to address new, specialized business needs that the Central IT organization was not responsible for addressing.
Here’s a few examples of information technologies that were introduced and managed by Shadow IT organizations to address specialized departmental needs.
- Word Processing. Possibly the first “end user system” (Wang, IBM DisplayWrite, etc.) This solution was revolutionary in reducing the cost of documentation
- The minicomputer. This technology revolution of the 70’s and 80’s delivered packaged, departmental application systems (DEC, Data General, Prime, etc.) The most popular were HR, accounting, and manufacturing applications.
- The personal computer. Many companies created PC support teams (in Finance) because they required unique skills that didn’t exist within most companies.
- Email, File Servers, and Ethernet (remember Banyan, Novell, 3com). These tools worked outside the mainframe OLTP environment and required specialized skills.
- Data Marts and Data Warehouses. Unless you purchased a product from IBM, the early products were often purchased and managed by marketing and finance.
- Business Intelligence tools. Many companies still manage analytics and report development outside of Central IT.
- CRM and ERP systems. While both of these packages required Central IT hardware platforms, the actual application systems are often supported by separate teams positioned within their respective business areas.
The success of Shadow IT is based on their ability to respond to specialized business needs with innovative solutions. The technologies above were all introduced to address specific departmental needs; they evolved to deliver more generalized capabilities that could be valued by the larger corporate audience. The larger audience required the technology’s ownership and support to migrate from the Shadow IT organization to Central IT. Unfortunately, most companies were ill prepared to support the transition of technology between the two different technology teams.
Most Central IT teams bristle at the idea of inheriting a Shadow IT project. There are significant costs associated with transitioning a project to a different team and a larger user audience. This is why many Central IT teams push for Shadow IT to adopt their standard tools and methods (or for the outright dissolution of Shadow IT). Unfortunately applying low-cost, standardized methods to deploy and support a specialized, high-value solution doesn’t work (if it did, it would have been used in the first place). You can’t expect to solve specialized needs with a one-size-fits-all approach.
A Shadow IT team delivers dozens of specialized solutions to their business user audience; the likelihood that any solution will be deployed to a larger audience is very small. While it’s certainly feasible to modify the charter, responsibilities, and success metrics of a Centralized IT organization to support both specialized unique and generalized high volume needs, I think there’s a better alternative: establish a set of methods and practices to address the infrequent transition of Shadow IT projects to Central IT. Both organizations should be obligated to work with and respond to the needs and responsibilities of the other technology team.
Most companies have multiple organizations with specific roles to address a variety of different activities. And organizations are expected to cooperate and work together to support the needs of the company. Why is it unrealistic to have Central IT and Shadow IT organizations with different roles to address the variety of (common and specialized) needs across a company?