Archive | data management RSS for this section

Data Strategy Component: Store

Store

This blog is 3rd in a series focused on reviewing the individual Components of a Data Strategy.  This edition discusses storage and the details involved with determining the most effective method for persisting data and ensuring that it can be found, accessed, and used.

The definition of Store is:

“Persisting data in a structure and location that supports access and processing across the user audience”

Information storage is one of the most basic responsibilities of an Information Technology organization – and it’s an activity that nearly every company addresses effectively.  On its surface, the idea of storage seems like a pretty simple concept:  setup and install servers with sufficient storage (disk, solid state, optical, etc.) to persist and retain information for a defined period of time.  And while this description is accurate, it’s incomplete.  In the era of exploding data volumes, unstructured content, 3rd party data, and need to share information, the actual media that contains the content is the tip of the iceberg.  The challenges with this Data Strategy Component are addressing all of the associated details involved with ensuring the data is accessible and usable.

In most companies, the options of where data is stored is overwhelming.  The core application systems use special technology to provide fast, highly reliable, and efficiently positioned data. The analytics world has numerous databases and platforms to support the loading and analyzing of a seemingly endless variety of content that spans the entirety of a company’s digital existence. Most team members’ desktops can expand their storage to handle 4 terabytes of data for less than a $100.  And there’s the cloud options that provide a nearly endless set of alternatives for small and large data content and processing needs.  Unfortunately, this high degree of flexibility has introduced a whole slew of challenges when it comes to managing storage:  finding the data, determining if the data has changed, navigating and accessing the details, and knowing the origin (or lineage).

I’ve identified 5 facets to consider when developing your Data Strategy and analyzing data storage and retention. As a reminder (from the initial Data Strategy Component blog), each facet should be considered individually.  And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to consider the different options for each.  Each facet can target a small organization’s issues or expand to focus on a large company’s diverse needs.

Stored Content

The most basic facet of storing data is to identify the type of content that will be stored:  raw application data, rationalized business content, or something in between.  It’s fairly common for companies to store the raw data from an application system (frequently in a data lake) as well as the cooked data (in a data warehouse).  The concept of “cooked” data refers to data that’s been standardized, cleaned, and stored in a state that’s “ready-to-use”.   It’s likely that your company also has numerous backup copies of the various images to support the recovery from a catastrophic situation.  The rigor of the content is independent of the platform where the data is stored.

Onboarding Content

There’s a bunch of work involved with acquiring and gathering data to store it and make it “ready-to-use”.  One of the challenges of having a diverse set of data from numerous sources is tracking what you have and knowing where it’s located. Any type of inventory requires that the “stuff” get tracked from the moment of creation.  The idea of Onboarding Content is to centrally manage and track all data that is coming into and distributed within your company (in much the same way that a receiving area works within a warehouse).  The core benefit of establishing Onboarding as a single point of data reception (or gathering) is that it ensures that there’s a single place to record (and track) all acquired data.  The secondary set of benefits are significant: it prevents unnecessary duplicate acquisition, provides a starting point for cataloging, and allows for the checking and acceptance of any purchased content (which is always an issue).

Navigation / Access

All too often, business people know the data want and may even know where the data is located; unfortunately, the problem is that they don’t know how to navigate and access the data where it’s stored (or created).  To be fair, most operational application systems were never designed for data sharing; they were configured to process data and support a specific set of business functions.  Consequently, accessing the data requires a significant level of system knowledge to navigate the associated repository to retrieve the data.   In developing a Data Strategy, it’s important to identify the skills, tools, and knowledge required for a user to access the data they require.  Will you require someone to have application interface and programming skills?  SQL skills and relational database knowledge?  Or, spreadsheet skills to access a flat file, or some other variation?

Change Control

Change control is a very simple concept: plan and schedule maintenance activities, identify outages, and communicate those details to everyone.  This is something that most technologists understand. In fact, most Information Technology organizations do a great job of production change control for their application environments.  Unfortunately, few if any organizations have implemented data change control.  The concept for data is just as simple:  plan and schedule maintenance activities, identify outages (data corruption, load problems, etc.), and communicate those details to everyone.  If you’re going to focus any energy on a data strategy, data change control should be considered in the top 5 items to be included as a goal and objective.

Platform Access

As I’ve already mentioned, most companies have lots of different options for housing data.  Unfortunately, the criteria for determining the actual resting place for data often comes down to convenience and availability. While many companies have architecture standards and recommendations for where applications and data are positioned, all too often the selection is based on either programmer convenience or resource availability.  The point of this area isn’t to argue what the selection criteria are, but to identify them based on core strategic (and business operation) priorities.

In your Data Strategy effort, you may find the need to include other facets in your analysis.  Some of the additional details that I’ve used in the past include metadata, security, retention, lineage, and archive access.  While simple in concept, this particular component continues to evolve and expand as the need for data access and sharing grows within the business world.

Data Strategy Component: Provision

Provision

This blog is the 2nd in a series focused on reviewing the individual Components of a Data Strategy.  This edition discusses the concept of data provisioning and the various details of making data sharable.

The definition of Provision is:

“Supplying data in a sharable form while respecting all rules and access guidelines”

One of the biggest frustrations that I have in the world of data is that few organizations have established data sharing as a responsibility.  Even fewer have setup the data to be ready to share and use by others.  It’s not uncommon for a database programmer or report developer to have to retrieve data from a dozen different systems to obtain the data they need.  And, the data arrives in different formats and files that change regularly.   This lack of consistency generates large ongoing maintenance costs and requires an inordinate amount of developer time to re-transform, prepare, fix data to be used (numerous studies have found that ongoing source data maintenance can take as much of 50% of the database developers time after the initial programming effort is completed).

Should a user have to know the details (or idiosyncrasies) of the application system that created the data to use the data? (That’s like expecting someone to understand the farming of tomatoes and manufacturing process of ketchup in order to be able to put ketchup on their hamburger).   The idea of Provision is to establish the necessary rigor to simplify the sharing of data.

I’ve identified 5 of the most common facets of data sharing in the illustration above – there are others.   As a reminder (from last week’s blog), each facet should be considered individually.  And because your Data Strategy goals will focus on future aspirational goals as well as current needs, you’ll likely to want to review the different options for each facet.  Each facet can target a small organization’s issues or expand to address a diverse enterprise’s needs. 

Packaging

This is the most obvious aspect of provisioning: structuring and formatting the data in a clear and understandable manner to the data consumer.  All too often data is packaged at the convenience of the developer instead of the convenience of the user. So, instead of sharing data as a backup file generated by an application utility in a proprietary (or binary) format, the data should be formatted so every field is labeled and formatted (text, XML) for a non-technical user to access using easily available tools. The data should also be accompanied with metadata to simplify access.

Platform Access

This facet works with Packaging and addresses the details associated with the data container.  Data can be shared via a file, a database table, an API, or one of several other methods.  While sharing data in a programmer generated file is better than nothing, a more effective approach would be to deliver data in a well-known file format (such as Excel) or within a table contained in an easily accessible database (e.g. data lake or data warehouse).

Stewardship

Source data stewardship is critical in the sharing of data.  In this context, a Source Data Steward is someone that is responsible for supporting and maintaining the shared data content (there several different types of data stewards).  In some companies, there’s a data steward responsible for the data originating from an individual source system.  Some companies (focused on sharing enterprise-level content) have positioned data stewards to support individual subject areas.  Regardless of the model used, the data steward tracks and communicates source data changes, monitors and maintains the shared content, and addresses support needs.   This particular role is vital if your organization is undertaking any sort of data self-service initiative.

Acceptance Checking

This item addresses the issues that are common in the world of electronic data sharing:  inconsistency, change, and error.  Acceptance checking is a quality control process that reviews the data prior to distribution to confirm that it matches a set of criteria to ensure that all downstream users receive content as they expect.  This item is likely the easiest of all details to implement given the power of existing data quality and data profiling tools. Unfortunately, it rarely receives attention because of most organization’s limited experience with data quality technology.

Data Audience

In order to succeed in any sort of data sharing initiative, whether in supporting other developers or an enterprise data self-service initiative, it’s important to identify the audience that will be supported.  This is often the facet to consider first, and it’s valuable to align the audience with the timeframe of data sharing support. It’s fairly common to focus on delivering data sharing for developers support first followed by technical users and then the large audience of business users.

In the era of “data is a business asset” , data sharing isn’t a courtesy, it’s an obligation.  Data sharing shouldn’t occur at the convenience of the data producer, it should be packaged and made available for the ease of the user.

The 5 Components of a Data Strategy

Because the idea of building a data strategy is a fairly new concept in the world of business and information technology (IT), there’s a fair amount of discussion about the pieces and parts that comprise a Data Strategy.   Most IT organizations have invested heavily in developing plans to address platforms, tools, and even storage.   Those IT plans are critical in managing systems and capturing and retaining content generated by a company’s production applications.  Unfortunately, those details don’t typically address all of the data activities that occur after an application has created and processed data from the initial business process. The reasons that folks take on the task of developing a Data Strategy is because of the challenges in finding, identifying, sharing, and using data.  In any company, there are numerous roles and activities involved in delivering data to support business processing and analysis.  A successful Data Strategy must support the breadth of activities necessary to ensure that data is “ready to use”.

There are five core components in a data strategy that work together as building blocks to address the various details necessary to comprehensively support the management and usage of data.

Identify          The ability to identify data and understand its meaning regardless of its structure, origin, or location.

This concept is pretty obvious, but it’s likely one of the biggest obstacles in data usage and sharing.  All too often, companies have multiple and different terms for specific business details (customer: account, client, patron; income: earnings, margin, profit).  In order to analyze, report, or use data, people need to understand what it’s called and how to identify it.  Another aspect of Identify is establishing the representation of the data’s value (Are the company’s geographic locations represented by name, number, or an abbreviation?)  A successful Data Strategy would identify the gaps and needs in this area and identify the necessary activities and artifacts required to standardize data identification and representation.

Provision       Enabling data to be packaged and made available while respecting all rules and access guidelines.

Data is often shared or made available to others at the convenience of the source system’s developers. The data is often accessible via database queries or as a series of files.  There’s rarely any uniformity across systems or subject areas, and usage requires programming level skills to analyze and inventory the contents of the various tables or files.  Unfortunately, the typical business person requiring data is unlikely to possess sophisticated programming and data manipulation skills.   They don’t want raw data (that reflects source system formats and inaccuracies), they want data that is uniformly formatted and documented that is ready to be added to their analysis activities.

The idea of Provision is to package and provide data that is “ready to use”.   A successful Data Strategy would identify the various data sharing needs and identify the necessary methods, practices, and tooling required to standardize data packaging and sharing.

Store               Persisting data in a structure and location that supports access and processing across the enterprise.

Most IT organizations have solid plans for addressing this area of a Data Strategy. It’s fairly common for most companies to have a well-defined set of methods to determine the platform where online data is stored and processed, how data is archived for disaster recovery, and all of the other details such as protection, retention, and monitoring.

As the technology world has evolved, there are other facets of this area that require attention.  The considerations include managing data distributed across multiple locations (the cloud, premise systems, and even multiple desktops), privacy and protection, and managing the proliferation of copies.   With the emergence of new consumer privacy laws, it’s risky to store multiple copies of data, and it’s become necessary to track all existing copies of content.  A successful Data Strategy ensures that any created data is always available for future access without requiring everyone to create their own copy.

Process           Standardizing, combining, and moving data residing in multiple locations and providing a unified view.

It’s no secret that data integration is one of the more costly activities occurring within an IT organization; nearly 40% of the cost of new development is consumed by data integration activities.  And Process isn’t limited to integration, it also includes correcting, standardizing, and formatting the content to make it “ready to use”.

With the growth of analytics and desktop decisioning making, the need to continually analyze and include new data sets into the decision-making process has exploded. Processing (or preparing or wrangling) data is no longer confined to the domain of the IT organization, it has become an end user activity.  A successful Data Strategy had to ensure that all users can be self-sufficient in their abilities to process data.

Govern           Establishing and communicating information rules, policies, and mechanisms to ensure effective data usage.

While most organizations are quick to identify their data as a core business asset, few have put the necessary rigor in place to effectively manage data.  Data Governance is about establishing rules, policies, and decision mechanisms to allow individuals to share and use data in a manner that respects the various (legal and usage) guidelines associated with that data.  The inevitable challenge with Data Governance is adoption by the entire data supply chain – from application developers to report developers to end users.  Data Governance isn’t a user-oriented concept, it’s a data-oriented concept.    A successful Data Strategy identifies the rigor necessary to ensure a core business asset is managed and used correctly.

The 5 Components of a Data Strategy is a framework to ensure that all of a company’s data usage details are captured and organized and that nothing is unknowingly overlooked.   A successful Data Strategy isn’t about identifying every potential activity across the 5 different components.  It’s about making sure that all of the identified solutions to the problems in accessing, sharing, and using data are reviewed and addressed in a thorough manner.

What is a Data Strategy?

20200525 strategy

A simple definition of Data Strategy is

A plan designed to improve all of the ways you acquire, store, manage, share, and use data”

Over the years, most companies have spent a fortune on their data.  They have a bunch of folks that comprise their “center of expertise”, they’ve invested lots of money in various data management tools (ETL-extract/transformation/load, metadata, data catalogs, data quality, etc.), and they’ve spent bazillions on storage and server systems to retain their terabytes or petabytes of data.  And what you often find is a lot of disparate (or independent) projects building specific deliverables for individual groups of users.   What you rarely find is a plan that addresses all of the disparate user needs that to support their ongoing access, sharing, use of data.

While most companies have solid platform strategies, storage strategies, tool strategies, and even development strategies, few companies have a data strategy.  The company has technology standards to ensure that every project uses a specific brand of server, a specific set of application development tools, a well-defined development method, and specific deliverables (requirements, code, test plan, etc.)  You rarely find data standards:  naming conventions and value standards, data hygiene and correction, source documentation and attribute definitions, or even data sharing and packaging conventions.  The benefit of a Data Strategy is that data development becomes reusable, repeatable, more reliable, faster.  Without a data strategy, the data activities within every project are always invented from scratch.  Developers continually search and analyze data sources, create new transformation and cleansing code, and retest the same data, again, and again, and again.

The value of a Data Strategy is that it provides a roadmap of tasks and activities to make data easier to access, share, and use.  A Data Strategy identifies the problems and challenges across multiple projects, multiple teams, and multiple business functions.  A Data Strategy identifies the different data needs across different projects, teams, and business functions.   A Data Strategy identifies the various activities and tasks that will deliver artifacts and methods that will benefit multiple projects, teams and business functions.   A Data Strategy delivers a plan and roadmap of deliverables that ensures that data across different projects, multiple teams, and business functions are reusable, repeatable, more reliable, and delivered faster.

A Data Strategy is a common thread across both disparate and related company projects to ensure that data is managed like a business asset, not an application byproduct.  It ensures that data is usable and reusable across a company.  A Data Strategy is a plan and road map for ensuring that data is simple to acquire, store, manage, share, and use.

Who Has My Personal Data?

20131129WhoHasMyData

In order to prepare for the cooking gauntlet that often occurs with the end of year holiday season, I decided to purchase a new rotisserie oven.  The folks at Acme Rotisserie include a large amount of documentation with their rotisserie. I reviewed the entire pile and was a bit surprised by the warranty registration card. The initial few questions made sense: serial number, place of purchase, date of purchase, my home address.  The other questions struck me as a bit too inquisitive: number of household occupants, household income, own/rent my residence, marital status, and education level. Obviously, this card was a Trojan horse of sorts; provide registration details –and all kinds of other personal information.  They wanted me to give away my personal information so they could analyze it, sell it, and make money off of it.

Companies collecting and analyzing consumer data isn’t anything new –it’s been going on for decades.  In fact, there are laws in place to protect consumer’s data in quite a few industries (healthcare, telecommunications, and financial services). Most of the laws focus on protecting the information that companies collect based on their relationship with you.  It’s not the just details that you provide to them directly; it’s the information that they gather about how you behave and what you purchase.  Most folks believe behavioral information is more valuable than the personal descriptive information you provide.  The reason is simple: you can offer creative (and highly inaccurate) details about your income, your education level, and the car you drive.  You can’t really lie about your behavior.

I’m a big fan of sharing my information if it can save me time, save me money, or generate some sort of benefit. I’m willing to share my waist size, shirt size, and color preferences with my personal shopper because I know they’ll contact me when suits or other clothing that I like is available at a good price.  I’m fine with a grocer tracking my purchases because they’ll offer me personalized coupons for those products.  I’m not okay with the grocer selling that information to my health insurer.  Providing my information to a company to enhance our relationship is fine; providing my information to a company so they can share, sell, or otherwise unilaterally benefit from it is not fine.  My data is proprietary and my intellectual property.

Clearly companies view consumer data to be a highly valuable asset.  Unfortunately, we’ve created a situation where there’s little or no cost to retain, use, or abuse that information. As abuse and problems have occurred within certain industries (financial services, healthcare, and others), we’ve created legislation to force companies to responsibly invest in the management and protection of that information. They have to contact you to let you know they have your information and allow you to update communications and marketing options. It’s too bad that every company with your personal information isn’t required to behave in the same way.  If data is so valuable that a company retains it, requiring some level of maintenance (and responsibility) shouldn’t be a big deal.

It’s really too bad that companies with copies of my personal information aren’t required to contact me to update and confirm the accuracy of all of my personal details. That would ensure that all of the specialized big data analytics that are being used to improve my purchase experiences were accurate. If I knew who had my data, I could make sure that my preferences were up to date and that the data was actually accurate.

It’s unfortunate that Acme Rotisserie isn’t required to contact me to confirm that I have 14 children, an advanced degree in swimming pool construction, and that I have Red Ferrari in my garage. It will certainly be interesting to see the personalized offers I receive for the upcoming Christmas shopping season.

Hadoop Replacing Data Warehouse Processing

20131126HadoopReplacingDW-SnakeOilSalesMan

I was recently asked about my opinion for the potential of Hadoop replacing a company’s data warehouse (DW).  While there’s lots to be excited about when it comes to Hadoop, I’m not currently in the camp of folks that believe it’s practical to use Hadoop to replace a company’s DW.  Most corporate DW systems are based on commercial relational database products and can store and manage multiple terabytes of data and support hundreds (if not thousands) of concurrent users.  It’s fairly common for these systems to handle complex, mixed workloads –queries processing billions of rows across numerous tables along with simple primary key retrieval requests all while continually loading data.  The challenge today is that Hadoop simply isn’t ready for this level of complexity.

All that being said,  I do believe there’s a huge opportunity to use Hadoop to replace a significant amount of processing that is currently being handled by most DWs.  Oh, and data warehouse user won’t be affected at all.

Let’s review a few fundamental details about the DW. There’s two basic data processing activities that occur on a DW: query processing and transformation processing. Query processing is servicing the SQL that’s submitted from all of the tools and applications on the users’ desktops, tablets, and phones.  Transformation processing is the workload involved with converting data from their source application formats to the format required by the data warehouse. While the most visible activity to business users is query processing, it is typically the smaller of the two.  Extracting and transforming the dozens (or hundreds) of source data files for the DW is a huge processing activity.  In fact, most DWs are not sized for query processing; they are sized for the daily transformation processing effort.

It’s important to realize that one of the most critical service level agreements (SLAs) of a DW is data delivery.  Business users want their data first thing each morning.  That means the DW has to be sized to deliver data reliably each and every business morning.  Since most platforms are anticipated to have a 3+ year life expectancy, IT has to size the DW system based on the worst case data volume scenario for that entire period (end of quarter, end of year, holidays, etc.) This means the DW is sized to address a maximum load that may only occur a few times during that entire period.

This is where the opportunity for Hadoop seems pretty obvious. Hadoop is a parallel, scalable framework that handles distributed batch processing and large data volumes. It’s really a set of tools and technologies for developers, not end users.  This is probably why so many ETL (extract, transformation, and load) product vendors have ported their products to execute within a Hadoop environment.  It only makes sense to migrate processing from a specialized platform to commodity hardware. Why bog down and over invest in your DW platform if you can handle the heavy lifting of transformation processing on a less expensive platform?

Introducing a new system to your DW environment will inevitably create new work for your DW architects and developers. However, the benefits are likely to be significant.  While some might view such an endeavor as a creative way to justify purchasing new hardware and installing Hadoop, the real reason is to extend the life of the data warehouse (and save your company a bunch of money by deferring a DW upgrade)

Data Quality, Data Maintenance

20121009 DataMaintenance

I read an interesting tidbit about data the other day:  the United States Postal Service processed more than 47 million changes of addresses in the last year.  That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality.  The idea of stale data is that as data ages, its accuracy and associated business rules can change.

There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality:  people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change.  Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value.  The challenge with maintaining any asset is establishing a reasonable maintenance plan.

Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention.  In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner.  Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.)  This happens so often, most data warehouse folks have even given it a name:  Recasting History.   Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.

  • Establish a regularly scheduled data maintenance window.  Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
  • Collect and publish data quality details.  Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
  • Keep the original data.  Most data quality processing overwrites original content with new details.  Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
  • Add source system identification and creation date/time details to every record.  While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
  • Schedule a regular data change control meeting.  This too is similar in concept to the change control meeting associated with IT operations teams.  This is a forum for discussing data content issues and changes

Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun.   Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting.   They simply don’t have the resources or time to address data maintenance as a proactive activity.

Business users clamor for new data and new reports; new funding is always tied to new business capabilities.  Support costs are budgeted, but they’re focused on software and hardware maintenance activities.  No one ever considers data maintenance; it’s simply ignored and forgotten.

Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement.  And where are we investing in maintenance?  The commodity tools, not the strategic corporate asset.

Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).

Advanced Data Virtualization Capabilities

20130925 AdvancedDV

In one of my previous blogs, I wrote about Data Virtualization technology — one of the more interesting pieces of middleware technology that can simplify data management.   While most of the commercial products in this space share a common set of features and functions, I thought I’d devote this blog to discussing the more advanced features.  There are quite a few competing products; the real challenge in differentiating the products is to understand their more advanced features.

The attraction of data virtualization is that it simplifies data access.  Most IT shops have one of everything – and this includes several different brands of commercial DBMSs, a few open source databases, a slew of BI/reporting tools, and the inevitable list of emerging and specialized tools and technologies (Hadoop, Dremel, Casandra, etc.) Supporting all of the client-to-server-to-repository interfaces (and the associated configurations) is both complex and time consuming.  This is why the advanced capabilities of Data Virtualization have become so valuable to the IT world.

The following details aren’t arranged in any particular order.  I’ve identified the ones that I’ve found to be the most valuable (and interesting).  Let me also acknowledge not every DV product supports all of these features.

Intelligent data caching.  Repository-to-DV Server data movement is the biggest obstacle in query response time.  Most DV products are able to support static caching to reduce repetitive data movement (data is copied and persisted in the DV Server).  Unfortunately, this approach has limited success when there are ad hoc users accessing dozens of sources and thousands of tables.  The more effective solution is for the DV Server to monitor all queries and dynamically cache data based on user access, query load, and table (and data) access frequency.

Query optimization (w/multi-platform execution). While all DV products claim some amount of query optimization, it’s important to know the details. There are lots of tricks and techniques; however, look for optimization that understands source data volumes, data distribution, data movement latency, and is able to process data on any source platform.

Support for multiple client Interfaces.  Since most companies have multiple database products, it can be cumbersome to support and maintain multiple client access configurations.  The DV server can act as a single access point for multiple vendor products (a single ODBC interface can replace drivers for each DBMS brand).  Additionally, most DV Server drivers support multiple different access methods (ODBC, JDBC, XML, and web services).

Attribute level or value specific data security.  This feature supports data security at a much lower granularity than is typically available with most DBMS products.  Data can be protected (or restricted) at individual column values for entire table or selective rows.

Metadata tracking and management.  Since Data Virtualization is a query-centric middleware environment, it only makes sense to position this server to retrieve, reconcile, and store metadata content from multiple, disparate data repositories.

Data lineage. This item works in tandem with the metadata capability and augments the information by retaining the source details for all data that is retrieved.  This not only includes source id information for individual records but also the origin, creation date, and native attribute details.

Query tracking for usage audit. Because the DV Server can act as a centralized access point for user tool access, there are several DV products that support the capture and tracking of all submitted queries.  This can be used to track, measure, and analyze end user (or repository) access.

Workflow linkage and processing.  This is the ability to execute predefined logic against specific data that is retrieved. While this concept is similar to a macro or stored procedure, it’s much more sophisticated.  It could include the ability to direct job control or specialized processing against an answer set prior to delivery (e.g. data hygiene, external access control, stewardship approval, etc.)

Packaged Application Templates.  Most packaged applications (CRM, ERP, etc.) contain thousands of tables and columns that can be very difficult to understand and query.  Several DV vendors have developed templates containing predefined DV server views that access the most commonly queried data elements.

Setup and Configuration Wizards. Configuring a DV server to access the multiple data sources can be a very time consuming exercise; the administrator needs to define and configure every source repository, the underlying tables (or files), along with the individual data fields.  To simplify setup, a configuration wizard reviews the dictionary of an available data source and generates the necessary DV Server configuration details. It further analyzes the table and column names to simplify naming conventions, joins, and data value conversion and standardization details.

Don’t be misled into thinking that Data Virtualization is a highly mature product space where all of the products are nearly identical.  They aren’t.  Most product vendors spend more time discussing their unique features instead of offering metrics about their their core features.  It’s important to remember that every Data Virtualization product requires a server that retrieves and processes data to fulfill query requests. This technology is not a commodity, which means that details like setup/configuration time, query performance, and advanced features can vary dramatically across products.  Benchmark and test drive the technology before buying.

The Power of Data Virtualization

20130911 Doorway

I was participating in a discussion about Data Virtualization (DV) the other day and was intrigued with the different views that everyone had about a technology that’s been around for more than 10 years. For those of you that don’t participate in IT-centric, geekfest discussions on a regular basis, Data Virtualization software is middleware that allows various disparate data sources to look like a single relational database.  Some folks characterize Data Virtualization as a software abstraction layer that removes the storage location and format complexities associated with manipulating data. The bottom line is that Data Virtualization software can make a BI (or any SQL) tool see data as though it’s contained within a single database even though it may be spread across multiple databases, XML files, and even Hadoop systems.

What intrigued me about the conversation is that most of the folks had been introduced to Data Virtualization not as an infrastructure tool that simplifies specific disparate data problems, but as the secret sauce or silver bullet for a specific application. They had all inherited an application that had been built outside of IT to address a business problem that required data to be integrated from a multitude of sources.  And in each instance, the applications were able to capitalize on Data Virtualization as a more cost effective solution for integrating detailed data. Instead of building a new platform to store and process another copy of the data, they used Data Virtualization software to query and integrate data from the individual sources systems. And each “solution” utilized a different combination of functions and capabilities.

As with any technology discussion, there’s always someone that believes that their favorite technology is the best thing since sliced bread – and they want to apply their solution to every problem.  Data Virtualization is an incredibly powerful technology with a broad array of functions that enable multi-source query processing. Given the relative obscurity of this data management technology, I thought I’d review some of the more basic capabilities supported by this technology.

Multi-Source Query Processing.  This is often referred to as Query Federation. The ability to have a single query process data across multiple data stores.

Simplify Data Access and Navigation.  Exposes data as single (virtual) data source from numerous component sources. The DV system handles the various network, SQL dialect, and/or data conversion issues.

Integrate Data “On the Fly”.  This is referred to as Data Federation. The DV server retrieves and integrates source data to support each individual query. 

Access to Non-Relational Data. The DV server is able to portray non-relational data (e.g. XML data, flat files, Hadoop, etc.) as structured, relational tables.  

Standardize and Transform Data. Once the data is retrieved from the origin, the DV server will convert the data (if necessary) into a format to support matching and integration.

Integrate Relational and Non-Relational Data. Because DV can make any data source (well, almost any) look like a relational table, this capability is implicit. Keep in mind that the data (or a subset of it) must have some sort of implicit structure.  

Expose a Data Services Interface. Exposing a web service that is attached to a predefined query that can be processed by the DV server.

Govern Ad Hoc Queries. The DV Server can monitor query submissions, run time, and even complexity – and terminate or prevent processing under specific rule-based situations.

Improve Data Security.  As a common point of access, the DV Server can support another level of data access security to address the likely inconsistencies that exist across multiple data store environments.

As many folks have learned, Data Virtualization is not a substitute for a data warehouse or a data mart.  In order for a DV Server to process data, the data must be retrieved from the origin; consequently, running a query that joins tables spread across multiple systems containing millions of records isn’t practical.  An Ethernet network is no substitute for the high speed interconnect linking a computer’s processor and memory to online storage. However, when the data is spread across multiple systems and there’s no other query alternative, Data Virtualization is certainly worth investigating.

%d bloggers like this: