Like many of you, I’m a big believer that data is a valuable business asset. Most business leaders understand the value of data and are prepared to make decisions, adjust their direction, or consider new ideas if the data exists to support the idea. However, while most folks agree that data is valuable, few have really changed their company’s culture or behavior when it comes to treating data as an asset.
The reality is that most corporate data is not treated as an asset. In fact, most company’s data management practices are rooted in methods and practices that are more than 30 years old. Treating data as a business asset is more than investing in storage and data transformation tools. Treating data like a valuable asset means managing, fixing, maintaining content to ensure it’s ready and reliable to support business activities. If you disagree, let’s take a look at how companies treat other valuable business assets.
Consider a well understood asset that exists within numerous companies: the automobile fleet. Companies that invest in automobile fleets do so because the productivity of their team members depends on having this reliable business tool. Automobile fleets exist because staff members require reliable transportation to fulfill their job responsibilities.
The company identifies and tracks the physical cars. They assign cars to individuals, and there’s a slew of rules and responsibilities associated with their use. Preventative maintenance and repairs are handled regularly to maintain the car’s value, reliability and readiness for use. Depending on the size of the fleet, the company may have staff members (equipped with the necessary tools) to handle the ongoing maintenance. The cars are also inspected on a regular basis to ensure that any problems are identified and resolved (again, to maintain its useful life and reliability). There is also criteria for disposing of cars at their end-of-life (which is predetermined based on when the costs and liabilities exceed their value). These activities aren’t discretionary, they are necessary to protect the company’s investment in their valuable business assets.
Now, consider applying the same set of concepts to your company’s data assets.
- Is someone responsible for tracking the data assets? (Is there a list of data sources? Are they updated/maintained? Is the list published?)
- Are the responsibilities and rules for data usage identified and documented? (Does this occur for all data assets, or is it specific to individual platforms?)
- Is there a team that is responsible for monitoring and inspecting data for problems? (Are they equipped with the necessary tools to accomplish such a task?)
- Is there anyone responsible for maintaining and/or fixing inaccurate data?
- Are there details reflecting the end-of-life criteria for your data assets when the liability and costs of the data exceed their value?
If you answered no to any these questions, it’s likely that your company views data as a tool or a commodity, but not a valuable business asset.
So, what do you do?
I certainly wouldn’t grab this list and run around the office claiming that the company isn’t treating data as an asset. Nor, would I suggest that you state that your company likely spends more money maintaining their automobile fleet then its business data. (I once accused a company of spending more on landscaping than data management. It wasn’t well received).
Instead, raise the idea of data investment as a means to increase the value and usefulness of data within the company. Conduct an informal survey to a handful of business users and ask them the time they lose looking for their data. Ask your ETL developers to estimate the time they spend fixing broken data, instead of their core job responsibilities. You’ll find the staff time lost because data isn’t managed and maintained as a business asset vastly exceeds the investment in preventative maintenance, tools, and repairs. You have to educate people about a problem before you can expect them to act to resolve the problem.
And if all else fails, find out how much your company spends on its automobile fleet (per user) and compare it to the non-existent resources spent maintaining and fixing your company’s other valuable business asset.
This blog is 3rd in a series focused on reviewing the individual Components of a Data Strategy. This edition discusses storage and the details involved with determining the most effective method for persisting data and ensuring that it can be found, accessed, and used.
The definition of Store is:
“Persisting data in a structure and location that supports access and processing across the user audience”
Information storage is one of the most basic responsibilities of an Information Technology organization – and it’s an activity that nearly every company addresses effectively. On its surface, the idea of storage seems like a pretty simple concept: setup and install servers with sufficient storage (disk, solid state, optical, etc.) to persist and retain information for a defined period of time. And while this description is accurate, it’s incomplete. In the era of exploding data volumes, unstructured content, 3rd party data, and need to share information, the actual media that contains the content is the tip of the iceberg. The challenges with this Data Strategy Component are addressing all of the associated details involved with ensuring the data is accessible and usable.
In most companies, the options of where data is stored is overwhelming. The core application systems use special technology to provide fast, highly reliable, and efficiently positioned data. The analytics world has numerous databases and platforms to support the loading and analyzing of a seemingly endless variety of content that spans the entirety of a company’s digital existence. Most team members’ desktops can expand their storage to handle 4 terabytes of data for less than a $100. And there’s the cloud options that provide a nearly endless set of alternatives for small and large data content and processing needs. Unfortunately, this high degree of flexibility has introduced a whole slew of challenges when it comes to managing storage: finding the data, determining if the data has changed, navigating and accessing the details, and knowing the origin (or lineage).
I’ve identified 5 facets to consider when developing your Data Strategy and analyzing data storage and retention. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually. And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to consider the different options for each. Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.
The most basic facet of storing data is to identify the type of content that will be stored: raw application data, rationalized business content, or something in between. It’s fairly common for companies to store the raw data from an application system (frequently in a data lake) as well as the cooked data (in a data warehouse). The concept of “cooked” data refers to data that’s been standardized, cleaned, and stored in a state that’s “ready-to-use”. It’s likely that your company also has numerous backup copies of the various images to support the recovery from a catastrophic situation. The rigor of the content is independent of the platform where the data is stored.
There’s a bunch of work involved with acquiring and gathering data to store it and make it “ready-to-use”. One of the challenges of having a diverse set of data from numerous sources is tracking what you have and knowing where it’s located. Any type of inventory requires that the “stuff” get tracked from the moment of creation. The idea of Onboarding Content is to centrally manage and track all data that is coming into and distributed within your company (in much the same way that a receiving area works within a warehouse). The core benefit of establishing Onboarding as a single point of data reception (or gathering) is that it ensures that there’s a single place to record (and track) all acquired data. The secondary set of benefits are significant: it prevents unnecessary duplicate acquisition, provides a starting point for cataloging, and allows for the checking and acceptance of any purchased content (which is always an issue).
Navigation / Access
All too often, business people know the data want and may even know where the data is located; unfortunately, the problem is that they don’t know how to navigate and access the data where it’s stored (or created). To be fair, most operational application systems were never designed for data sharing; they were configured to process data and support a specific set of business functions. Consequently, accessing the data requires a significant level of system knowledge to navigate the associated repository to retrieve the data. In developing a Data Strategy, it’s important to identify the skills, tools, and knowledge required for a user to access the data they require. Will you require someone to have application interface and programming skills? SQL skills and relational database knowledge? Or, spreadsheet skills to access a flat file, or some other variation?
Change control is a very simple concept: plan and schedule maintenance activities, identify outages, and communicate those details to everyone. This is something that most technologists understand. In fact, most Information Technology organizations do a great job of production change control for their application environments. Unfortunately, few if any organizations have implemented data change control. The concept for data is just as simple: plan and schedule maintenance activities, identify outages (data corruption, load problems, etc.), and communicate those details to everyone. If you’re going to focus any energy on a data strategy, data change control should be considered in the top 5 items to be included as a goal and objective.
As I’ve already mentioned, most companies have lots of different options for housing data. Unfortunately, the criteria for determining the actual resting place for data often comes down to convenience and availability. While many companies have architecture standards and recommendations for where applications and data are positioned, all too often the selection is based on either programmer convenience or resource availability. The point of this area isn’t to argue what the selection criteria are, but to identify them based on core strategic (and business operation) priorities.
In your Data Strategy effort, you may find the need to include other facets in your analysis. Some of the additional details that I’ve used in the past include metadata, security, retention, lineage, and archive access. While simple in concept, this particular component continues to evolve and expand as the need for data access and sharing grows within the business world.
I received a funny email the other day about excuses that school children use to explain why they haven’t done their homework. The examples were pretty creative: “my mother took it to be framed”, “I got soap in my eyes and was blinded all night”, and (an oldie and a goody) –“my dog ate my homework”. It’s a shame that such a creative approach yielded such a high rate of failure. Most of us learn at an early age that you can’t talk your way out of failure; success requires that you do the work. You’d also think that as people got older and more evolved, they’d realize that there’s very few shortcuts in life.
I’m frequently asked to conduct best practice reviews of business intelligence and data warehouse (BI/DW) projects. These activities usually come about because either users or IT management is concerned with development productivity or delivery quality. The review activity is pretty straight forward; interviews are scheduled and artifacts are analyzed to review the various phases, from requirements through construction to deployment. It’s always interesting to look at how different organizations handle architecture, code design, development, and testing. One of the keys to conducting a review effort is to focus on the actual results (or artifacts) that are generated during each stage. It’s foolish to discuss someone’s development method or style prior to reviewing the completeness of the artifacts. It’s not necessary to challenge someone approach if their artifacts reflect the details required for the other phases.
And one of the most common problems that I’ve seen with BI/DW development is the lack of documented requirements. Zip – zero –zilch – nothing. While discussions about requirements gathering, interview styles, and even document details occur occasionally, it’s the lack of any documented requirements that’s the norm. I can’t imagine how any company allows development to begin without ensuring that requirements are documented and approved by the stakeholders. Believe it or not, it happens a lot.
So, as a tribute to the creative school children of yesterday and today, I thought I would devote this blog to some of the most creative excuses I’ve heard from development teams to justify their beginning work without having requirements documentation.
- “The project’s schedule was published. We have to deliver something with or without requirements”
- “We use the agile methodology, it’s doesn’t require written requirements”
- “The users don’t know what they want.”
- “The users are always too busy to meet with us”
- “My bonus is based on the number of new reports I create. We don’t measure our code against requirements”
- “We know what the users want, we just haven’t written it down”
- “We’ll document the requirements once our code is complete and testing finished”
- “We can spend our time writing requirements, or we can spend our time coding”
- “It’s not our responsibility to document requirements; the users need to handle that”
- “I’ve been told not to communicate with the business users”
Many of the above items clearly reflect a broken set of management or communication methods. Expecting a development team to adhere to a project schedule when they don’t have requirements is ridiculous. Forcing a team to commit to deliverables without requirements challenges conventional development methods and financial common sense. It also reflects leadership that focuses on schedules, utilization and not business value.
A development team that is asked to build software without a set of requirements is being set up to fail. I’m always astonished that anyone would think they can argue and justify that the lack of documented requirements is acceptable. I guess there are still some folks that believe they can talk their way out of failure.
As I wrote in last week’s blog post, a data warehouse appliance simplifies platform and system resource administration. It doesn’t simplify the traditional time-intensive efforts of managing and integrating disparate data and addressing performance and tuning of various applications that contend for the same resources.
Many data warehouse appliance vendors offer sophisticated parallel processing environments, query optimization, and specialized storage structures to improve query processing (e.g., columnar-based engines). It’s naïve to think that taking data from an SMP (Symmetric Multi-Processing) relational database and moving it into a parallel processing environment will effectively scale without any adjustments or changes. Moving onto an appliance can be likened to moving into a new house. When you move into a new, larger house, you quickly learn that it’s not as simple as dumping all of your stuff into the new house. The different dimensions of the new rooms cause you realize that some of your old furniture or rugs simple don’t fit. You inevitably have to make adjustments if you want to truly enjoy your new home. The same goes with a data warehouse appliance; it likely has numerous features to support growth and scalability; you have to make adjustments to leverage their benefits.
Companies that expect to simply dump their data from a few legacy data marts over to a new appliance should expect to confront some adjustments or their likely to experience some unpleasant surprises. Here are some that we’ve already seen.
Everyone agrees that the biggest cost issue behind building a data warehouse is ETL design and development. Hoping to migrate existing ETL jobs into a new hardware and processing environment without expecting rework is short-sighted. While you can probably force fit your existing job streams, you’ll inevitably misuse the new system, waste system resources, and dramatically reduce the lifespan of the appliance. Each appliance has its own way of handling the intensive resource requirements of data loading – in much the same way that each incumbent database product addresses these same situations. If you’ve justified an appliance through the benefits of consolidating multiple data marts (that contain duplicate data), it only makes sense to consolidate and integrate the ETL processes to prevent processing duplication and waste.
To assume that because you’ve built your ETL architecture leveraging the latest and greatest ETL software technology that you won’t have to review the underlying ETL architecture is also misguided. While there’s no question that migrating tool-based ETL jobs to a new platform can be much easier than lower-level code, the issue at hand isn’t the source and destination– it’s the underlying table structures. Not every table will change in definition on a new platform, but the largest (and most used) table content is the most likely candidate for review and redesign. Each appliance handles data distribution and database design differently. Consequently, since the underlying table structures are likely to require adjustment, plan on a redesign of the actual ETL process too.
I’m also surprised by the casual attitude regarding technical training. After all, it’s just a SQL database, right? But application developers and data warehouse development staff need to understand the differences of the appliance product (after all, it’s a different database version or product). While most of this knowledge can be gained through reading the manuals – when was the last time the DBAs or database developers actually had a full-set of manuals—much less the time required to read them? The investment in training isn’t significant—usually just a few days of classes. If you’re going to provide your developers with a product that claims to bigger, better, and faster than its competitors, doesn’t it make sense to prepare them adequately to use it?
There’s also an assumption that—since most data warehouse appliance vendors are software-only—that there are no hardware implications. On the contrary, you should expect to change your existing hardware. The way memory and storage are configured on a data warehouse appliance can differ from a general-purpose server, but it’s still rare that the hardware costs are factored into the development plan. And believing that older servers can be re-purposed has turned out to be a myth. If you ‘re attempting to support more storage, more processing, and more users, how can using older equipment (with the related higher maintenance costs) make financial sense?
You could certainly fork-lift your data, leave all the ETL jobs alone, and not change any processing. Then again, you could save a fortune on a new data warehouse appliance and simply do nothing. After all, no one argues with the savings associated with doing nothing—except, of course, the users that need the data to run your business.
photo by Bien Stephenson via Flickr (Creative Commons License)