This blog is 3rd in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses storage and the details involved with determining the most effective method for persisting data and ensuring that it can be found, accessed, and used.
The definition of Store is:
“Persisting data in a structure and location that supports access and processing across the user audience”
Information storage is one of the most basic responsibilities of an Information Technology organization – and it’s an activity that nearly every company addresses effectively. On its surface, the idea of storage seems like a pretty simple concept: setup and install servers with sufficient storage (disk, solid state, optical, etc.) to persist and retain information for a defined period of time. And while this description is accurate, it’s incomplete. In the era of exploding data volumes, unstructured content, 3rd party data, and need to share information, the actual media that contains the content is the tip of the iceberg. The challenges with this Data Strategy Component are addressing all of the associated details involved with ensuring the data is accessible and usable.
In most companies, the options of where data is stored is overwhelming. The core application systems use special technology to provide fast, highly reliable, and efficiently positioned data. The analytics world has numerous databases and platforms to support the loading and analyzing of a seemingly endless variety of content that spans the entirety of a company’s digital existence. Most team members’ desktops can expand their storage to handle 4 terabytes of data for less than a $100. And there’s the cloud options that provide a nearly endless set of alternatives for small and large data content and processing needs. Unfortunately, this high degree of flexibility has introduced a whole slew of challenges when it comes to managing storage: finding the data, determining if the data has changed, navigating and accessing the details, and knowing the origin (or lineage).
I’ve identified 5 facets to consider when developing your Data Strategy and analyzing data storage and retention. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to consider the different options for each. Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.
The most basic facet of storing data is to identify the type of content that will be stored: raw application data, rationalized business content, or something in between. It’s fairly common for companies to store the raw data from an application system (frequently in a data lake) as well as the cooked data (in a data warehouse). The concept of “cooked” data refers to data that’s been standardized, cleaned, and stored in a state that’s “ready-to-use”. It’s likely that your company also has numerous backup copies of the various images to support the recovery from a catastrophic situation. The rigor of the content is independent of the platform where the data is stored.
There’s a bunch of work involved with acquiring and gathering data to store it and make it “ready-to-use”. One of the challenges of having a diverse set of data from numerous sources is tracking what you have and knowing where it’s located. Any type of inventory requires that the “stuff” get tracked from the moment of creation. The idea of Onboarding Content is to centrally manage and track all data that is coming into and distributed within your company (in much the same way that a receiving area works within a warehouse). The core benefit of establishing Onboarding as a single point of data reception (or gathering) is that it ensures that there’s a single place to record (and track) all acquired data. The secondary set of benefits are significant: it prevents unnecessary duplicate acquisition, provides a starting point for cataloging, and allows for the checking and acceptance of any purchased content (which is always an issue).
Navigation / Access
All too often, business people know the data want and may even know where the data is located; unfortunately, the problem is that they don’t know how to navigate and access the data where it’s stored (or created). To be fair, most operational application systems were never designed for data sharing; they were configured to process data and support a specific set of business functions. Consequently, accessing the data requires a significant level of system knowledge to navigate the associated repository to retrieve the data. In developing a Data Strategy, it’s important to identify the skills, tools, and knowledge required for a user to access the data they require. Will you require someone to have application interface and programming skills? SQL skills and relational database knowledge? Or, spreadsheet skills to access a flat file, or some other variation?
Change control is a very simple concept: plan and schedule maintenance activities, identify outages, and communicate those details to everyone. This is something that most technologists understand. In fact, most Information Technology organizations do a great job of production change control for their application environments. Unfortunately, few if any organizations have implemented data change control. The concept for data is just as simple: plan and schedule maintenance activities, identify outages (data corruption, load problems, etc.), and communicate those details to everyone. If you’re going to focus any energy on a data strategy, data change control should be considered in the top 5 items to be included as a goal and objective.
As I’ve already mentioned, most companies have lots of different options for housing data. Unfortunately, the criteria for determining the actual resting place for data often comes down to convenience and availability. While many companies have architecture standards and recommendations for where applications and data are positioned, all too often the selection is based on either programmer convenience or resource availability. The point of this area isn’t to argue what the selection criteria are, but to identify them based on core strategic (and business operation) priorities.
In your Data Strategy effort, you may find the need to include other facets in your analysis. Some of the additional details that I’ve used in the past include metadata, security, retention, lineage, and archive access. While simple in concept, this particular component continues to evolve and expand as the need for data access and sharing grows within the business world.
This blog is the 2nd in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses the concept of data provisioning and the various details of making data sharable.
The definition of Provision is:
“Supplying data in a sharable form while respecting all rules and access guidelines”
One of the biggest frustrations that I have in the world of data is that few organizations have established data sharing as a responsibility. Even fewer have setup the data to be ready to share and use by others. It’s not uncommon for a database programmer or report developer to have to retrieve data from a dozen different systems to obtain the data they need. And, the data arrives in different formats and files that change regularly. This lack of consistency generates large ongoing maintenance costs and requires an inordinate amount of developer time to re-transform, prepare, fix data to be used (numerous studies have found that ongoing source data maintenance can take as much of 50% of the database developers time after the initial programming effort is completed).
Should a user have to know the details (or idiosyncrasies) of the application system that created the data to use the data? (That’s like expecting someone to understand the farming of tomatoes and manufacturing process of ketchup in order to be able to put ketchup on their hamburger). The idea of Provision is to establish the necessary rigor to simplify the sharing of data.
I’ve identified 5 of the most common facets of data sharing in the illustration above – there are others. As a reminder (from last week’s blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to review the different options for each facet. Each facet can target a small organization’s issues or expand to address a diverse enterprise’s needs.
This is the most obvious aspect of provisioning: structuring and formatting the data in a clear and understandable manner to the data consumer. All too often data is packaged at the convenience of the developer instead of the convenience of the user. So, instead of sharing data as a backup file generated by an application utility in a proprietary (or binary) format, the data should be formatted so every field is labeled and formatted (text, XML) for a non-technical user to access using easily available tools. The data should also be accompanied with metadata to simplify access.
This facet works with Packaging and addresses the details associated with the data container. Data can be shared via a file, a database table, an API, or one of several other methods. While sharing data in a programmer generated file is better than nothing, a more effective approach would be to deliver data in a well-known file format (such as Excel) or within a table contained in an easily accessible database (e.g. data lake or data warehouse).
Source data stewardship is critical in the sharing of data. In this context, a Source Data Steward is someone that is responsible for supporting and maintaining the shared data content (there several different types of data stewards). In some companies, there’s a data steward responsible for the data originating from an individual source system. Some companies (focused on sharing enterprise-level content) have positioned data stewards to support individual subject areas. Regardless of the model used, the data steward tracks and communicates source data changes, monitors and maintains the shared content, and addresses support needs. This particular role is vital if your organization is undertaking any sort of data self-service initiative.
This item addresses the issues that are common in the world of electronic data sharing: inconsistency, change, and error. Acceptance checking is a quality control process that reviews the data prior to distribution to confirm that it matches a set of criteria to ensure that all downstream users receive content as they expect. This item is likely the easiest of all details to implement given the power of existing data quality and data profiling tools. Unfortunately, it rarely receives attention because of most organization’s limited experience with data quality technology.
In order to succeed in any sort of data sharing initiative, whether in supporting other developers or an enterprise data self-service initiative, it’s important to identify the audience that will be supported. This is often the facet to consider first, and it’s valuable to align the audience with the timeframe of data sharing support. It’s fairly common to focus on delivering data sharing for developers support first followed by technical users and then the large audience of business users.
In the era of “data is a business asset” , data sharing isn’t a courtesy, it’s an obligation. Data sharing shouldn’t occur at the convenience of the data producer, it should be packaged and made available for the ease of the user.
Because the idea of building a data strategy is a fairly new concept in the world of business and information technology (IT), there’s a fair amount of discussion about the pieces and parts that comprise a Data Strategy. Most IT organizations have invested heavily in developing plans to address platforms, tools, and even storage. Those IT plans are critical in managing systems and capturing and retaining content generated by a company’s production applications. Unfortunately, those details don’t typically address all of the data activities that occur after an application has created and processed data from the initial business process. The reasons that folks take on the task of developing a Data Strategy is because of the challenges in finding, identifying, sharing, and using data. In any company, there are numerous roles and activities involved in delivering data to support business processing and analysis. A successful Data Strategy must support the breadth of activities necessary to ensure that data is “ready to use”.
There are five core components in a data strategy that work together as building blocks to address the various details necessary to comprehensively support the management and usage of data.
Identify The ability to identify data and understand its meaning regardless of its structure, origin, or location.
This concept is pretty obvious, but it’s likely one of the biggest obstacles in data usage and sharing. All too often, companies have multiple and different terms for specific business details (customer: account, client, patron; income: earnings, margin, profit). In order to analyze, report, or use data, people need to understand what it’s called and how to identify it. Another aspect of Identify is establishing the representation of the data’s value (Are the company’s geographic locations represented by name, number, or an abbreviation?) A successful Data Strategy would identify the gaps and needs in this area and identify the necessary activities and artifacts required to standardize data identification and representation.
Provision Enabling data to be packaged and made available while respecting all rules and access guidelines.
Data is often shared or made available to others at the convenience of the source system’s developers. The data is often accessible via database queries or as a series of files. There’s rarely any uniformity across systems or subject areas, and usage requires programming level skills to analyze and inventory the contents of the various tables or files. Unfortunately, the typical business person requiring data is unlikely to possess sophisticated programming and data manipulation skills. They don’t want raw data (that reflects source system formats and inaccuracies), they want data that is uniformly formatted and documented that is ready to be added to their analysis activities.
The idea of Provision is to package and provide data that is “ready to use”. A successful Data Strategy would identify the various data sharing needs and identify the necessary methods, practices, and tooling required to standardize data packaging and sharing.
Store Persisting data in a structure and location that supports access and processing across the enterprise.
Most IT organizations have solid plans for addressing this area of a Data Strategy. It’s fairly common for most companies to have a well-defined set of methods to determine the platform where online data is stored and processed, how data is archived for disaster recovery, and all of the other details such as protection, retention, and monitoring.
As the technology world has evolved, there are other facets of this area that require attention. The considerations include managing data distributed across multiple locations (the cloud, premise systems, and even multiple desktops), privacy and protection, and managing the proliferation of copies. With the emergence of new consumer privacy laws, it’s risky to store multiple copies of data, and it’s become necessary to track all existing copies of content. A successful Data Strategy ensures that any created data is always available for future access without requiring everyone to create their own copy.
Process Standardizing, combining, and moving data residing in multiple locations and providing a unified view.
It’s no secret that data integration is one of the more costly activities occurring within an IT organization; nearly 40% of the cost of new development is consumed by data integration activities. And Process isn’t limited to integration, it also includes correcting, standardizing, and formatting the content to make it “ready to use”.
With the growth of analytics and desktop decisioning making, the need to continually analyze and include new data sets into the decision-making process has exploded. Processing (or preparing or wrangling) data is no longer confined to the domain of the IT organization, it has become an end user activity. A successful Data Strategy had to ensure that all users can be self-sufficient in their abilities to process data.
Govern Establishing and communicating information rules, policies, and mechanisms to ensure effective data usage.
While most organizations are quick to identify their data as a core business asset, few have put the necessary rigor in place to effectively manage data. Data Governance is about establishing rules, policies, and decision mechanisms to allow individuals to share and use data in a manner that respects the various (legal and usage) guidelines associated with that data. The inevitable challenge with Data Governance is adoption by the entire data supply chain – from application developers to report developers to end users. Data Governance isn’t a user-oriented concept, it’s a data-oriented concept. A successful Data Strategy identifies the rigor necessary to ensure a core business asset is managed and used correctly.
The 5 Components of a Data Strategy is a framework to ensure that all of a company’s data usage details are captured and organized and that nothing is unknowingly overlooked. A successful Data Strategy isn’t about identifying every potential activity across the 5 different components. It’s about making sure that all of the identified solutions to the problems in accessing, sharing, and using data are reviewed and addressed in a thorough manner.
During my time teaching Data Strategy in the class room, I’m frequently asked the question, “how do I know if I need a data strategy?” For those of you that are deep thinkers, business strategists, or even data architects, I suspect your answer is either “yes!” or “why not?”.
When I’m asked that question, I actually think there’s a different question at hand, “Should I invest the time in developing a data strategy instead of something else?”
In today’s business world, there’s not a shortage of “to do list” items. So, prioritizing the development of a Data Strategy means deprioritizing some other item. In order to understand the relative priority and benefit of a Data Strategy initiative, take a look at the need, pain, or problem you’re addressing along with the quantity of people affected. Your focus should be understanding how a Data Strategy initiative will benefit the team members’ ability to do their job.
To get started, I usually spend time up front interviewing folks to understand the strengths, weaknesses, challenges, and opportunities that exist with data within a company (or organization). Let me share 5 questions that I always ask.
- Is the number of users (or organizations) building queries/reports to analyze data growing?
- Are there multiple reports containing conflicting information?
- Can a new staff member find and use data on their own, or does it require weeks or months of staff mentoring?
- Is data systematically inspected for accuracy (and corrected)? Is anyone responsible for fixing “broken data”?
- Is anyone responsible for data sharing?
While you might think these questions are a bit esoteric, each one has a specific purpose. I’m a big fan of positioning any new strategy initiative to clearly identify the problems that are going to be solved. If you’re going to undertake the development of a Data Strategy, you want to make certain that you will improve staff members’ ability to make decisions and be more effective at their jobs. These questions will help you identify where people struggle getting the job done, or where there’s an unquantified risk with using data to make decisions.
So, let me offer an explanation of each question.
- “Is the number of users (or organizations) building queries/reports to analyze data growing”
The value of a strategy is directly proportional to the number of people that are going to be affected. In the instance of a data strategy, it’s valuable to understand the number of people that use data (hands-on) to make decisions or do their jobs. If the number is small or decreasing, a strategy initiative may not be worth the investment in time and effort. The larger the number, the greater the impact to the effectiveness (and productivity) to the various staff members.
- “Are there multiple reports containing conflicting information? “
If you have conflicting details within your company that means decisions are made with inaccurate data. That also means that there’s mistrust of information and team members are spending time confirming details. That’s business risk and a tremendous waste of time.
- “Can a new staff member find and use data…”
If a new staff member can’t be self-sufficient after a week or two on the job (when it comes to data access and usage), you have a problem. That’s like someone joining the company and not having access to office supplies, a parking space, and email. And, if the only way to learn is to beg for time for other team members – your spending time with two people not doing their job. It’s a problem that’s being ignored.
- “Is data systematically inspected for accuracy (and corrected)? …”
This item is screaming for attention. If you’re in a company that uses data to make decisions, and no one is responsible for inspecting the content, you have a problem. Think about this issue another way: would you purchase hamburger at the grocery store if there was a sign that stated “Never inspected. May be spoiled. Not our responsibility”?
- Is anyone responsible for data sharing?
This item gets little attention in most companies and is likely the most important of all the questions. If data is a necessary ingredient in decision making and there isn’t anyone actively responsible for ensuring that new data assets are captured, stored, tracked, managed, and shared, you’re saying that data isn’t a business asset. (How many assets in the company aren’t tied to someone’s responsibilities?)
If the answer to all of the questions is “no” – great. You’re in an environment where data is likely managed in a manner that supports a multitude of team members’ needs across different organizations. If you answered “yes” to a single question, it’s likely that an incremental investment in a tactical data management effort would be helpful. If more than 1 question is answered “yes”, your company (and the team) will benefit from a Data Strategy initiative.
I’ve been consulting in the data management space for quite a few years, and I’m often asked about the importance and need for a Data Strategy.
All too often, the idea of “strategy” brings the images of piles of papers, academics-styled charts, and a list of unachievable goals identifying the topic at hand, but not reflecting reality. Developing a strategy isn’t about identifying perfection – it’s about identifying a set of goals that address problems and needs that require attention. A solid data strategy isn’t about identifying perfection, it’s about identifying a set of goals that are achievable and good enough to improve your data environment. A data strategy is also about identifying the tasks and activities necessary to achieve those goals. A data strategy is more than the finish line, it’s about the path of the journey. And, it’s about making sure the journey and goal are possible.
Companies spend a fortune on data. They purchase servers and storage farms to store the data, database management systems to manage the data, transformation tools to convert and transform the data, data quality tools to fix and standardize the content, and treasure trove of analytical tools to present content that can be understood by business people. Given all of the activities, the players, and the content, why would you not want a plan?
Unfortunately, few organizations have a Data Strategy. They have lots of technology plans and roadmaps. They have platform and server plans; they have DBMS standards; they have storage strategies; they likely have analytical tool plans. While these are valuable, they are typically focused on an organization or function with minimal concern for all of the related upstream and downstream activities (how usable is a data warehouse if the data exists as multiple copies with different names and different formats, and hasn’t been checked/fixed for accuracy?) A data strategy is a plan that ensures that data is easy to find, easy to identify, easy to use, and easy to share across the company and across multiple functions.
Information technologists are exceptionally strong in the world of applications, tools, and platforms. They understand the importance of ensuring “reusability” and the benefit of an “economies-of-scale” approach. These are both just nice sound bites focused on making sure that new development work doesn’t always require reinvention. Application strategies include identifying standards (tools, platforms, storage locations, etc.) and repeatable methods to ensure efficient construction and delivery of data that can be serviced, maintained, and upgraded. An assembly line of sorts.
The challenge with most data environments is that a data strategy rarely exists; there is no repeatable methods and practices. Every new request requires building data and the associated deliverables from scratch. And, once delivered, there’s a huge testing and confirmation effort to ensure that the data is accurate. If you had a data strategy, you’d have reusable data, repeatable methods, and the details would be referenceable online instead of through tribal knowledge. And delivery efficiency and cost would improve over time.
Why do you need a data strategy? Because the cost of data is growing –and it should be shrinking. The cost of data processing has shrunk, the cost of data storage has decreased dramatically, but the cost of data delivery continues to grow. A data strategy focuses on delivering data that is easy to find, easy to use, and easy to share.
A simple definition of Data Strategy is
“ A plan designed to improve all of the ways you acquire, store, manage, share, and use data”
Over the years, most companies have spent a fortune on their data. They have a bunch of folks that comprise their “center of expertise”, they’ve invested lots of money in various data management tools (ETL-extract/transformation/load, metadata, data catalogs, data quality, etc.), and they’ve spent bazillions on storage and server systems to retain their terabytes or petabytes of data. And what you often find is a lot of disparate (or independent) projects building specific deliverables for individual groups of users. What you rarely find is a plan that addresses all of the disparate user needs that to support their ongoing access, sharing, use of data.
While most companies have solid platform strategies, storage strategies, tool strategies, and even development strategies, few companies have a data strategy. The company has technology standards to ensure that every project uses a specific brand of server, a specific set of application development tools, a well-defined development method, and specific deliverables (requirements, code, test plan, etc.) You rarely find data standards: naming conventions and value standards, data hygiene and correction, source documentation and attribute definitions, or even data sharing and packaging conventions. The benefit of a Data Strategy is that data development becomes reusable, repeatable, more reliable, faster. Without a data strategy, the data activities within every project are always invented from scratch. Developers continually search and analyze data sources, create new transformation and cleansing code, and retest the same data, again, and again, and again.
The value of a Data Strategy is that it provides a roadmap of tasks and activities to make data easier to access, share, and use. A Data Strategy identifies the problems and challenges across multiple projects, multiple teams, and multiple business functions. A Data Strategy identifies the different data needs across different projects, teams, and business functions. A Data Strategy identifies the various activities and tasks that will deliver artifacts and methods that will benefit multiple projects, teams and business functions. A Data Strategy delivers a plan and roadmap of deliverables that ensures that data across different projects, multiple teams, and business functions are reusable, repeatable, more reliable, and delivered faster.
A Data Strategy is a common thread across both disparate and related company projects to ensure that data is managed like a business asset, not an application byproduct. It ensures that data is usable and reusable across a company. A Data Strategy is a plan and road map for ensuring that data is simple to acquire, store, manage, share, and use.