In one of my previous blogs, I wrote about Data Virtualization technology — one of the more interesting pieces of middleware technology that can simplify data management. While most of the commercial products in this space share a common set of features and functions, I thought I’d devote this blog to discussing the more advanced features. There are quite a few competing products; the real challenge in differentiating the products is to understand their more advanced features.
The attraction of data virtualization is that it simplifies data access. Most IT shops have one of everything – and this includes several different brands of commercial DBMSs, a few open source databases, a slew of BI/reporting tools, and the inevitable list of emerging and specialized tools and technologies (Hadoop, Dremel, Casandra, etc.) Supporting all of the client-to-server-to-repository interfaces (and the associated configurations) is both complex and time consuming. This is why the advanced capabilities of Data Virtualization have become so valuable to the IT world.
The following details aren’t arranged in any particular order. I’ve identified the ones that I’ve found to be the most valuable (and interesting). Let me also acknowledge not every DV product supports all of these features.
Intelligent data caching. Repository-to-DV Server data movement is the biggest obstacle in query response time. Most DV products are able to support static caching to reduce repetitive data movement (data is copied and persisted in the DV Server). Unfortunately, this approach has limited success when there are ad hoc users accessing dozens of sources and thousands of tables. The more effective solution is for the DV Server to monitor all queries and dynamically cache data based on user access, query load, and table (and data) access frequency.
Query optimization (w/multi-platform execution). While all DV products claim some amount of query optimization, it’s important to know the details. There are lots of tricks and techniques; however, look for optimization that understands source data volumes, data distribution, data movement latency, and is able to process data on any source platform.
Support for multiple client Interfaces. Since most companies have multiple database products, it can be cumbersome to support and maintain multiple client access configurations. The DV server can act as a single access point for multiple vendor products (a single ODBC interface can replace drivers for each DBMS brand). Additionally, most DV Server drivers support multiple different access methods (ODBC, JDBC, XML, and web services).
Attribute level or value specific data security. This feature supports data security at a much lower granularity than is typically available with most DBMS products. Data can be protected (or restricted) at individual column values for entire table or selective rows.
Metadata tracking and management. Since Data Virtualization is a query-centric middleware environment, it only makes sense to position this server to retrieve, reconcile, and store metadata content from multiple, disparate data repositories.
Data lineage. This item works in tandem with the metadata capability and augments the information by retaining the source details for all data that is retrieved. This not only includes source id information for individual records but also the origin, creation date, and native attribute details.
Query tracking for usage audit. Because the DV Server can act as a centralized access point for user tool access, there are several DV products that support the capture and tracking of all submitted queries. This can be used to track, measure, and analyze end user (or repository) access.
Workflow linkage and processing. This is the ability to execute predefined logic against specific data that is retrieved. While this concept is similar to a macro or stored procedure, it’s much more sophisticated. It could include the ability to direct job control or specialized processing against an answer set prior to delivery (e.g. data hygiene, external access control, stewardship approval, etc.)
Packaged Application Templates. Most packaged applications (CRM, ERP, etc.) contain thousands of tables and columns that can be very difficult to understand and query. Several DV vendors have developed templates containing predefined DV server views that access the most commonly queried data elements.
Setup and Configuration Wizards. Configuring a DV server to access the multiple data sources can be a very time consuming exercise; the administrator needs to define and configure every source repository, the underlying tables (or files), along with the individual data fields. To simplify setup, a configuration wizard reviews the dictionary of an available data source and generates the necessary DV Server configuration details. It further analyzes the table and column names to simplify naming conventions, joins, and data value conversion and standardization details.
Don’t be misled into thinking that Data Virtualization is a highly mature product space where all of the products are nearly identical. They aren’t. Most product vendors spend more time discussing their unique features instead of offering metrics about their their core features. It’s important to remember that every Data Virtualization product requires a server that retrieves and processes data to fulfill query requests. This technology is not a commodity, which means that details like setup/configuration time, query performance, and advanced features can vary dramatically across products. Benchmark and test drive the technology before buying.
I was participating in a discussion about Data Virtualization (DV) the other day and was intrigued with the different views that everyone had about a technology that’s been around for more than 10 years. For those of you that don’t participate in IT-centric, geekfest discussions on a regular basis, Data Virtualization software is middleware that allows various disparate data sources to look like a single relational database. Some folks characterize Data Virtualization as a software abstraction layer that removes the storage location and format complexities associated with manipulating data. The bottom line is that Data Virtualization software can make a BI (or any SQL) tool see data as though it’s contained within a single database even though it may be spread across multiple databases, XML files, and even Hadoop systems.
What intrigued me about the conversation is that most of the folks had been introduced to Data Virtualization not as an infrastructure tool that simplifies specific disparate data problems, but as the secret sauce or silver bullet for a specific application. They had all inherited an application that had been built outside of IT to address a business problem that required data to be integrated from a multitude of sources. And in each instance, the applications were able to capitalize on Data Virtualization as a more cost effective solution for integrating detailed data. Instead of building a new platform to store and process another copy of the data, they used Data Virtualization software to query and integrate data from the individual sources systems. And each “solution” utilized a different combination of functions and capabilities.
As with any technology discussion, there’s always someone that believes that their favorite technology is the best thing since sliced bread – and they want to apply their solution to every problem. Data Virtualization is an incredibly powerful technology with a broad array of functions that enable multi-source query processing. Given the relative obscurity of this data management technology, I thought I’d review some of the more basic capabilities supported by this technology.
Multi-Source Query Processing. This is often referred to as Query Federation. The ability to have a single query process data across multiple data stores.
Simplify Data Access and Navigation. Exposes data as single (virtual) data source from numerous component sources. The DV system handles the various network, SQL dialect, and/or data conversion issues.
Integrate Data “On the Fly”. This is referred to as Data Federation. The DV server retrieves and integrates source data to support each individual query.
Access to Non-Relational Data. The DV server is able to portray non-relational data (e.g. XML data, flat files, Hadoop, etc.) as structured, relational tables.
Standardize and Transform Data. Once the data is retrieved from the origin, the DV server will convert the data (if necessary) into a format to support matching and integration.
Integrate Relational and Non-Relational Data. Because DV can make any data source (well, almost any) look like a relational table, this capability is implicit. Keep in mind that the data (or a subset of it) must have some sort of implicit structure.
Expose a Data Services Interface. Exposing a web service that is attached to a predefined query that can be processed by the DV server.
Govern Ad Hoc Queries. The DV Server can monitor query submissions, run time, and even complexity – and terminate or prevent processing under specific rule-based situations.
Improve Data Security. As a common point of access, the DV Server can support another level of data access security to address the likely inconsistencies that exist across multiple data store environments.
As many folks have learned, Data Virtualization is not a substitute for a data warehouse or a data mart. In order for a DV Server to process data, the data must be retrieved from the origin; consequently, running a query that joins tables spread across multiple systems containing millions of records isn’t practical. An Ethernet network is no substitute for the high speed interconnect linking a computer’s processor and memory to online storage. However, when the data is spread across multiple systems and there’s no other query alternative, Data Virtualization is certainly worth investigating.
I presented a webinar a few weeks back that challenged the popular opinion that the only way to be successful with data science was to hire an individual that has a swiss army knife of data skills and business acumen. (The archived webinar link is http://goo.gl/Ka1H2I )
While I can’t argue on the value of such abilities, my belief is that these types of individuals are very rare, and the benefits of data science is something that can be valued by every company. Consequently, my belief is that you can approach data science successfully through building a team of focused staff members, providing they cover 5 role areas: Data Services, Data Engineer, Data Manager, Production Development, and the Data Scientist.
I received quite a few questions during and after the August 12th webinar, so I thought I would devote this week’s blog to those questions (and answers). As is always the case with a blog, feel free to comment, respond, or disagree – I’ll gladly post the feedback below.
Q: In terms of benefits and costs, do you have any words of wisdom in building a business case that can be taken to business leadership for funding
A: Business case constructs vary by company. What I encourage folks to focus on is the opportunity value in supporting a new initiative. Justifying an initial data science initiative shouldn’t be difficult if your company already supports individuals analyzing data on their desktops. We often find collecting the existing investment numbers and the results of your advanced analytics team (SAS, R, SPSS, etc.) often justifies delving into the world of Data Science
Q: One problem is that many business leaders do not have a concept of what goes into a scientific discovery process. They are not schooled as scientists.
A: You’re absolutely correct. Most managers are focused on establishing business process, measuring progress, and delivering results. Discovery and exploration isn’t always a predictable process. We often find that initial Data Science initiatives are more likely to be successful if the environment has already adopted the value of reporting and advanced analytics (numerical analysis, data mining, prediction, etc.) If your organization hasn’t fully adopted business intelligence and desktop analysis, you may not be ready for Data Science. If your organization already understands the value of detailed data and analysis – you might want to begin with a more focused analytic effort (e.g. identifying trend indicators, predictive details, or other modeling activities.) We’ve seen data science deliver significant business value, but it also requires a manager that understands the complexities and issues of data exploration and discovery.
Q: One of the challenges that we’ve seen in my company is the desire to force fit Data Science into a traditional IT waterfall development method instead of realizing the benefits of taking a more iterative or agile approach. Is there danger in this approach?
A: We find that the when organizations already have an existing (and robust) business intelligence and analytics environments, there’s a tendency to follow the tried and true practices of defined requirements, documented project plans, managed development, and scheduled delivery. One thing to keep in mind is that the whole premise of Data Science is analyzing data to uncover new patterns or knowledge. When you first undertake a Data Science initiative, it’s about exploration and discovery, not structured deliverables. It’s reasonable to spin up a project team (preferably using an iterative or agile methodology) once the discovery has been identified and there’s tangible business value to build and deploy a solution using the discovery. However, it’s important to allow the discovery to happen first.
You might consider reading an article from DJ Patil (“Building Data Science Teams“) that discusses the importance of having a Production Development role that I mentioned. This is the role that takes on the creation of a production deliverable from the raw artifacts and discoveries made by the Data Science team
Q: It seems like your Data Engineer has a similar role and responsibility set as a Data Warehouse architect or ETL developer
A: The Data Engineers are a hybrid of sorts. They handle all of the data transformation and integration activities and they are also deeply knowledgeable of the underlying data sources and the content. We often find that the Data Warehouse Architect and ETL Developer are very knowledgeable about the data structures of source and target systems, but they aren’t typically knowledgeable on social media content, external sources, unstructured data, and the lower details of the specific data attributes. Obviously, these skills vary from organization to organization. If the individuals in your organization are intimate with this level of knowledge, they may be able to cover the activities associated with a Data Engineer.
Q : What is the difference between the Data Engineers and Data Management team members?
A: Data Engineers focus on retrieving and manipulation data from the various data stores (external and internal). They deal with data transformation, correction, and integration. The Data Management folks support the Data Engineers (thus the skill overlap) but focus more on managing and tracking the actual data assets that are going to be used by data scientists and other analysts within the company (understanding the content, the values, the formats, and the idiosyncrasies).
Q: Isn’t there a risk in building a team of folks with specialized skills (instead of having individuals with a broad set of knowledge). With specialists, don’t we risk freezing the current state of the art, making the organization inflexible to change? Doesn’t it also reduce everyone’s overall understanding of the goal (e.g. the technicians focus on their tools’ functions, not the actual results they’re being expected to deliver)
A: While I see your perspective, I’d suggest a slightly different view. The premise of defining the various roles is to identify the work activities (and skills) necessary to complete a body of work. Each role should still evolve with skill growth — to ensure individuals can handle more and more complex activities. There will continue to be enormous growth and evolution in the world of Data Science in the variety of external data sources, number of data interfaces, and the variety of data integration tools. Establishing different roles ensures there’s an awareness of the breadth of skills required to complete the body of work. It’s entirely reasonable for an individual to cover multiple roles; however, as the workload increases, it’s very likely that specialization will be necessary to support the added work effort. Henry Ford used the assembly line to revolutionize manufacturing. He was able to utilize less skilled workers to handle the less sophisticated tasks so he could ensure his craftsmen continued to focus on more and more specialized (and complex) activities. Data integration and management activities support (and enable) Data Science. Specialization should be focused on the less complex (and more easily staffed) roles that will free up the Data Scientist’s time to allow them to focus on their core strengths.
Q: : Is this intended to be an Enterprise wide team?
A: We’ve seen Data Science teams be positioned as an organizational resource (e.g. specific to support marketing analytics); we’ve also seen teams set up as an enterprise resource. The decision is typically driven by the culture and needs of your company.
Q: Where is the business orientation in the data team? Do you need someone that knows what questions to ask and then take all of the data and distill it down to insights that a CEO can implement.
A: The “business orientation” usually resides with the Data Scientist role. The Data Science team isn’t typically setup to respond to business user requests (like a traditional BI team); they are usually driven by the Data Scientist that understands and is tasked with addressing the priority needs of the company. The Data Scientist doesn’t work in a vacuum; they have to interact with key business stakeholders on a regular basis. However, Data Science shouldn’t be structured like a traditional applications development team either. The teams is focused on discovery and exploration – not core IT development. Take a look at one of the more popular articles on the topic, “Data Scientist: the sexiest job of the 21st century” by Tom Davenport and DJ Patil http://goo.gl/CmCtv9
Photo courtesy of National Archive via Flickr (Creative Commons license).