The Power of Data Virtualization

20130911 Doorway

I was participating in a discussion about Data Virtualization (DV) the other day and was intrigued with the different views that everyone had about a technology that’s been around for more than 10 years. For those of you that don’t participate in IT-centric, geekfest discussions on a regular basis, Data Virtualization software is middleware that allows various disparate data sources to look like a single relational database.  Some folks characterize Data Virtualization as a software abstraction layer that removes the storage location and format complexities associated with manipulating data. The bottom line is that Data Virtualization software can make a BI (or any SQL) tool see data as though it’s contained within a single database even though it may be spread across multiple databases, XML files, and even Hadoop systems.

What intrigued me about the conversation is that most of the folks had been introduced to Data Virtualization not as an infrastructure tool that simplifies specific disparate data problems, but as the secret sauce or silver bullet for a specific application. They had all inherited an application that had been built outside of IT to address a business problem that required data to be integrated from a multitude of sources.  And in each instance, the applications were able to capitalize on Data Virtualization as a more cost effective solution for integrating detailed data. Instead of building a new platform to store and process another copy of the data, they used Data Virtualization software to query and integrate data from the individual sources systems. And each “solution” utilized a different combination of functions and capabilities.

As with any technology discussion, there’s always someone that believes that their favorite technology is the best thing since sliced bread – and they want to apply their solution to every problem.  Data Virtualization is an incredibly powerful technology with a broad array of functions that enable multi-source query processing. Given the relative obscurity of this data management technology, I thought I’d review some of the more basic capabilities supported by this technology.

Multi-Source Query Processing.  This is often referred to as Query Federation. The ability to have a single query process data across multiple data stores.

Simplify Data Access and Navigation.  Exposes data as single (virtual) data source from numerous component sources. The DV system handles the various network, SQL dialect, and/or data conversion issues.

Integrate Data “On the Fly”.  This is referred to as Data Federation. The DV server retrieves and integrates source data to support each individual query. 

Access to Non-Relational Data. The DV server is able to portray non-relational data (e.g. XML data, flat files, Hadoop, etc.) as structured, relational tables.  

Standardize and Transform Data. Once the data is retrieved from the origin, the DV server will convert the data (if necessary) into a format to support matching and integration.

Integrate Relational and Non-Relational Data. Because DV can make any data source (well, almost any) look like a relational table, this capability is implicit. Keep in mind that the data (or a subset of it) must have some sort of implicit structure.  

Expose a Data Services Interface. Exposing a web service that is attached to a predefined query that can be processed by the DV server.

Govern Ad Hoc Queries. The DV Server can monitor query submissions, run time, and even complexity – and terminate or prevent processing under specific rule-based situations.

Improve Data Security.  As a common point of access, the DV Server can support another level of data access security to address the likely inconsistencies that exist across multiple data store environments.

As many folks have learned, Data Virtualization is not a substitute for a data warehouse or a data mart.  In order for a DV Server to process data, the data must be retrieved from the origin; consequently, running a query that joins tables spread across multiple systems containing millions of records isn’t practical.  An Ethernet network is no substitute for the high speed interconnect linking a computer’s processor and memory to online storage. However, when the data is spread across multiple systems and there’s no other query alternative, Data Virtualization is certainly worth investigating.

Tags: , ,

About Evan Levy

Evan Levy is management consultant and partner at IntegralData. In addition to his day-to-day job responsibilities, Evan speaks, writes, and blogs about the challenges of managing and using data to support business decision making.

3 responses to “The Power of Data Virtualization”

  1. Suhaas says :

    Evan – Great article! Provides a clear understanding of the technology’s key capabilities for those who are new to it.

    To add to your ‘data services’ point, Data Virtualization enables real-time (or right time) data delivery to consumers due to its ‘on-the-fly’ processing as well as scheduling & caching features built into it.

  2. igpres says :

    There is an implication that data can never be permanently “lost”, and can always be retrieved in some way, such as by multi-channel virtualization. Doesn’t that make the uproar over government “storage” irrelevant? Whether it’s the Feds or the Terrorists, someone smart enough can always gain access to the metadata again.

    • Evan Levy says :

      I actually believe data can be lost. There’s no work around when you delete database records or erase files if there’s no backup. Many believe that once data has been placed out on the web, it will last forever. While that may be technically accurate, I’m not sure I’d rely on that technicality.

Leave a comment