In order to prepare for the cooking gauntlet that often occurs with the end of year holiday season, I decided to purchase a new rotisserie oven. The folks at Acme Rotisserie include a large amount of documentation with their rotisserie. I reviewed the entire pile and was a bit surprised by the warranty registration card. The initial few questions made sense: serial number, place of purchase, date of purchase, my home address. The other questions struck me as a bit too inquisitive: number of household occupants, household income, own/rent my residence, marital status, and education level. Obviously, this card was a Trojan horse of sorts; provide registration details –and all kinds of other personal information. They wanted me to give away my personal information so they could analyze it, sell it, and make money off of it.
Companies collecting and analyzing consumer data isn’t anything new –it’s been going on for decades. In fact, there are laws in place to protect consumer’s data in quite a few industries (healthcare, telecommunications, and financial services). Most of the laws focus on protecting the information that companies collect based on their relationship with you. It’s not the just details that you provide to them directly; it’s the information that they gather about how you behave and what you purchase. Most folks believe behavioral information is more valuable than the personal descriptive information you provide. The reason is simple: you can offer creative (and highly inaccurate) details about your income, your education level, and the car you drive. You can’t really lie about your behavior.
I’m a big fan of sharing my information if it can save me time, save me money, or generate some sort of benefit. I’m willing to share my waist size, shirt size, and color preferences with my personal shopper because I know they’ll contact me when suits or other clothing that I like is available at a good price. I’m fine with a grocer tracking my purchases because they’ll offer me personalized coupons for those products. I’m not okay with the grocer selling that information to my health insurer. Providing my information to a company to enhance our relationship is fine; providing my information to a company so they can share, sell, or otherwise unilaterally benefit from it is not fine. My data is proprietary and my intellectual property.
Clearly companies view consumer data to be a highly valuable asset. Unfortunately, we’ve created a situation where there’s little or no cost to retain, use, or abuse that information. As abuse and problems have occurred within certain industries (financial services, healthcare, and others), we’ve created legislation to force companies to responsibly invest in the management and protection of that information. They have to contact you to let you know they have your information and allow you to update communications and marketing options. It’s too bad that every company with your personal information isn’t required to behave in the same way. If data is so valuable that a company retains it, requiring some level of maintenance (and responsibility) shouldn’t be a big deal.
It’s really too bad that companies with copies of my personal information aren’t required to contact me to update and confirm the accuracy of all of my personal details. That would ensure that all of the specialized big data analytics that are being used to improve my purchase experiences were accurate. If I knew who had my data, I could make sure that my preferences were up to date and that the data was actually accurate.
It’s unfortunate that Acme Rotisserie isn’t required to contact me to confirm that I have 14 children, an advanced degree in swimming pool construction, and that I have Red Ferrari in my garage. It will certainly be interesting to see the personalized offers I receive for the upcoming Christmas shopping season.
I was recently asked about my opinion for the potential of Hadoop replacing a company’s data warehouse (DW). While there’s lots to be excited about when it comes to Hadoop, I’m not currently in the camp of folks that believe it’s practical to use Hadoop to replace a company’s DW. Most corporate DW systems are based on commercial relational database products and can store and manage multiple terabytes of data and support hundreds (if not thousands) of concurrent users. It’s fairly common for these systems to handle complex, mixed workloads –queries processing billions of rows across numerous tables along with simple primary key retrieval requests all while continually loading data. The challenge today is that Hadoop simply isn’t ready for this level of complexity.
All that being said, I do believe there’s a huge opportunity to use Hadoop to replace a significant amount of processing that is currently being handled by most DWs. Oh, and data warehouse user won’t be affected at all.
Let’s review a few fundamental details about the DW. There’s two basic data processing activities that occur on a DW: query processing and transformation processing. Query processing is servicing the SQL that’s submitted from all of the tools and applications on the users’ desktops, tablets, and phones. Transformation processing is the workload involved with converting data from their source application formats to the format required by the data warehouse. While the most visible activity to business users is query processing, it is typically the smaller of the two. Extracting and transforming the dozens (or hundreds) of source data files for the DW is a huge processing activity. In fact, most DWs are not sized for query processing; they are sized for the daily transformation processing effort.
It’s important to realize that one of the most critical service level agreements (SLAs) of a DW is data delivery. Business users want their data first thing each morning. That means the DW has to be sized to deliver data reliably each and every business morning. Since most platforms are anticipated to have a 3+ year life expectancy, IT has to size the DW system based on the worst case data volume scenario for that entire period (end of quarter, end of year, holidays, etc.) This means the DW is sized to address a maximum load that may only occur a few times during that entire period.
This is where the opportunity for Hadoop seems pretty obvious. Hadoop is a parallel, scalable framework that handles distributed batch processing and large data volumes. It’s really a set of tools and technologies for developers, not end users. This is probably why so many ETL (extract, transformation, and load) product vendors have ported their products to execute within a Hadoop environment. It only makes sense to migrate processing from a specialized platform to commodity hardware. Why bog down and over invest in your DW platform if you can handle the heavy lifting of transformation processing on a less expensive platform?
Introducing a new system to your DW environment will inevitably create new work for your DW architects and developers. However, the benefits are likely to be significant. While some might view such an endeavor as a creative way to justify purchasing new hardware and installing Hadoop, the real reason is to extend the life of the data warehouse (and save your company a bunch of money by deferring a DW upgrade)
I read an interesting tidbit about data the other day: the United States Postal Service processed more than 47 million changes of addresses in the last year. That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality. The idea of stale data is that as data ages, its accuracy and associated business rules can change.
There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality: people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change. Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value. The challenge with maintaining any asset is establishing a reasonable maintenance plan.
Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention. In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner. Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.) This happens so often, most data warehouse folks have even given it a name: Recasting History. Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.
- Establish a regularly scheduled data maintenance window. Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
- Collect and publish data quality details. Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
- Keep the original data. Most data quality processing overwrites original content with new details. Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
- Add source system identification and creation date/time details to every record. While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
- Schedule a regular data change control meeting. This too is similar in concept to the change control meeting associated with IT operations teams. This is a forum for discussing data content issues and changes
Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun. Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting. They simply don’t have the resources or time to address data maintenance as a proactive activity.
Business users clamor for new data and new reports; new funding is always tied to new business capabilities. Support costs are budgeted, but they’re focused on software and hardware maintenance activities. No one ever considers data maintenance; it’s simply ignored and forgotten.
Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement. And where are we investing in maintenance? The commodity tools, not the strategic corporate asset.
Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).
In one of my previous blogs, I wrote about Data Virtualization technology — one of the more interesting pieces of middleware technology that can simplify data management. While most of the commercial products in this space share a common set of features and functions, I thought I’d devote this blog to discussing the more advanced features. There are quite a few competing products; the real challenge in differentiating the products is to understand their more advanced features.
The attraction of data virtualization is that it simplifies data access. Most IT shops have one of everything – and this includes several different brands of commercial DBMSs, a few open source databases, a slew of BI/reporting tools, and the inevitable list of emerging and specialized tools and technologies (Hadoop, Dremel, Casandra, etc.) Supporting all of the client-to-server-to-repository interfaces (and the associated configurations) is both complex and time consuming. This is why the advanced capabilities of Data Virtualization have become so valuable to the IT world.
The following details aren’t arranged in any particular order. I’ve identified the ones that I’ve found to be the most valuable (and interesting). Let me also acknowledge not every DV product supports all of these features.
Intelligent data caching. Repository-to-DV Server data movement is the biggest obstacle in query response time. Most DV products are able to support static caching to reduce repetitive data movement (data is copied and persisted in the DV Server). Unfortunately, this approach has limited success when there are ad hoc users accessing dozens of sources and thousands of tables. The more effective solution is for the DV Server to monitor all queries and dynamically cache data based on user access, query load, and table (and data) access frequency.
Query optimization (w/multi-platform execution). While all DV products claim some amount of query optimization, it’s important to know the details. There are lots of tricks and techniques; however, look for optimization that understands source data volumes, data distribution, data movement latency, and is able to process data on any source platform.
Support for multiple client Interfaces. Since most companies have multiple database products, it can be cumbersome to support and maintain multiple client access configurations. The DV server can act as a single access point for multiple vendor products (a single ODBC interface can replace drivers for each DBMS brand). Additionally, most DV Server drivers support multiple different access methods (ODBC, JDBC, XML, and web services).
Attribute level or value specific data security. This feature supports data security at a much lower granularity than is typically available with most DBMS products. Data can be protected (or restricted) at individual column values for entire table or selective rows.
Metadata tracking and management. Since Data Virtualization is a query-centric middleware environment, it only makes sense to position this server to retrieve, reconcile, and store metadata content from multiple, disparate data repositories.
Data lineage. This item works in tandem with the metadata capability and augments the information by retaining the source details for all data that is retrieved. This not only includes source id information for individual records but also the origin, creation date, and native attribute details.
Query tracking for usage audit. Because the DV Server can act as a centralized access point for user tool access, there are several DV products that support the capture and tracking of all submitted queries. This can be used to track, measure, and analyze end user (or repository) access.
Workflow linkage and processing. This is the ability to execute predefined logic against specific data that is retrieved. While this concept is similar to a macro or stored procedure, it’s much more sophisticated. It could include the ability to direct job control or specialized processing against an answer set prior to delivery (e.g. data hygiene, external access control, stewardship approval, etc.)
Packaged Application Templates. Most packaged applications (CRM, ERP, etc.) contain thousands of tables and columns that can be very difficult to understand and query. Several DV vendors have developed templates containing predefined DV server views that access the most commonly queried data elements.
Setup and Configuration Wizards. Configuring a DV server to access the multiple data sources can be a very time consuming exercise; the administrator needs to define and configure every source repository, the underlying tables (or files), along with the individual data fields. To simplify setup, a configuration wizard reviews the dictionary of an available data source and generates the necessary DV Server configuration details. It further analyzes the table and column names to simplify naming conventions, joins, and data value conversion and standardization details.
Don’t be misled into thinking that Data Virtualization is a highly mature product space where all of the products are nearly identical. They aren’t. Most product vendors spend more time discussing their unique features instead of offering metrics about their their core features. It’s important to remember that every Data Virtualization product requires a server that retrieves and processes data to fulfill query requests. This technology is not a commodity, which means that details like setup/configuration time, query performance, and advanced features can vary dramatically across products. Benchmark and test drive the technology before buying.
I was participating in a discussion about Data Virtualization (DV) the other day and was intrigued with the different views that everyone had about a technology that’s been around for more than 10 years. For those of you that don’t participate in IT-centric, geekfest discussions on a regular basis, Data Virtualization software is middleware that allows various disparate data sources to look like a single relational database. Some folks characterize Data Virtualization as a software abstraction layer that removes the storage location and format complexities associated with manipulating data. The bottom line is that Data Virtualization software can make a BI (or any SQL) tool see data as though it’s contained within a single database even though it may be spread across multiple databases, XML files, and even Hadoop systems.
What intrigued me about the conversation is that most of the folks had been introduced to Data Virtualization not as an infrastructure tool that simplifies specific disparate data problems, but as the secret sauce or silver bullet for a specific application. They had all inherited an application that had been built outside of IT to address a business problem that required data to be integrated from a multitude of sources. And in each instance, the applications were able to capitalize on Data Virtualization as a more cost effective solution for integrating detailed data. Instead of building a new platform to store and process another copy of the data, they used Data Virtualization software to query and integrate data from the individual sources systems. And each “solution” utilized a different combination of functions and capabilities.
As with any technology discussion, there’s always someone that believes that their favorite technology is the best thing since sliced bread – and they want to apply their solution to every problem. Data Virtualization is an incredibly powerful technology with a broad array of functions that enable multi-source query processing. Given the relative obscurity of this data management technology, I thought I’d review some of the more basic capabilities supported by this technology.
Multi-Source Query Processing. This is often referred to as Query Federation. The ability to have a single query process data across multiple data stores.
Simplify Data Access and Navigation. Exposes data as single (virtual) data source from numerous component sources. The DV system handles the various network, SQL dialect, and/or data conversion issues.
Integrate Data “On the Fly”. This is referred to as Data Federation. The DV server retrieves and integrates source data to support each individual query.
Access to Non-Relational Data. The DV server is able to portray non-relational data (e.g. XML data, flat files, Hadoop, etc.) as structured, relational tables.
Standardize and Transform Data. Once the data is retrieved from the origin, the DV server will convert the data (if necessary) into a format to support matching and integration.
Integrate Relational and Non-Relational Data. Because DV can make any data source (well, almost any) look like a relational table, this capability is implicit. Keep in mind that the data (or a subset of it) must have some sort of implicit structure.
Expose a Data Services Interface. Exposing a web service that is attached to a predefined query that can be processed by the DV server.
Govern Ad Hoc Queries. The DV Server can monitor query submissions, run time, and even complexity – and terminate or prevent processing under specific rule-based situations.
Improve Data Security. As a common point of access, the DV Server can support another level of data access security to address the likely inconsistencies that exist across multiple data store environments.
As many folks have learned, Data Virtualization is not a substitute for a data warehouse or a data mart. In order for a DV Server to process data, the data must be retrieved from the origin; consequently, running a query that joins tables spread across multiple systems containing millions of records isn’t practical. An Ethernet network is no substitute for the high speed interconnect linking a computer’s processor and memory to online storage. However, when the data is spread across multiple systems and there’s no other query alternative, Data Virtualization is certainly worth investigating.
I’ve been intrigued with all of the attention that the world of Data Science has received. It seems that every popular business magazine has published several articles and it’s become a mainstream topic at most industry conferences. One of the things that struck me as odd is that there’s a group of folks that actually believe that all of the activities necessary to deliver new business discoveries with data science can be reasonably addressed by finding individuals that have a cornucopia of technical and business skills. One popular belief is that a Data Scientist should be able to address all of the business and technical activities necessary to identify, qualify, prove, and explain a business idea with detailed data.
If you can find individuals that comprehend the peculiarities of source data extraction, have mastered data integration techniques, understand parallel algorithms to process tens of billions of records, have worked with specialized data preparation tools, and can debate your company’s business strategy and priorities – Cool! Hire these folks and chain their leg to the desk as soon as possible.
If you can’t, you might consider building a team that can cover the various roles that are necessary to support a Data Science initiative. There’s a lot more to Data Science than simply processing a pile of data with the latest open source framework. The roles that you should consider include:
Manages the various data repositories that feed data to the analytics effort. This includes understanding the schemas, tracking the data content, and making sure the platforms are maintained. Companies with existing data warehouses, data marts, or reporting systems typically have a group of folks focused on these activities (DBAs, administrators, etc.).
Responsible for developing and implementing tools to gather, move, process, and manage data. In most analytics environments, these activities are handled by the data integration team. In the world of Big Data or Data Science, this isn’t just ETL development for batch files; it also includes processing data streams and handling the cleansing and standardization of numerous structured and unstructured data sources.
Handles the traditional data management or source data stewardship role; the focus is supporting development access and manipulation of data content. This includes tracking the available data sources (internal and external), understanding the location and underlying details of specific attributes, and supporting developers’ code construction efforts.
Responsible for packaging the Data Scientist discoveries into a production ready deliverable. This may include (one or) many components: new data attributes, new algorithms, a new data processing method, or an entirely new end-user tool. The goal is to ensure that the discoveries deliver business value.
The team leader and the individual that excels at analyzing data to help a business gain a competitive edge. They are adept at technical activities and equally qualified to lead a business discussion as to the benefits of a new business strategy or approach. They can tackle all aspects of a problem and often lead the interdisciplinary team to construct an analytics solution.
There’s no shortage of success stories about the amazing data discoveries uncovered by Data Scientists. In many of those companies, the Data Scientist didn’t have an incumbent data warehousing or analytics environment; they couldn’t pick up the phone to call a data architect, there wasn’t any metadata documentation, and their company didn’t have a standard set of data management tools. They were on their own. So, the Data Scientist became “chief cook and bottle washer” for everything that is big data and analytics.
Most companies today have institutionalized data analysis; there are multiple data warehouses, lots of dashboards, and even a query support desk. And while there’s a big difference between desktop reporting and processing social media feedback, much of the “behind the scenes” data management and data integration work is the same. If your company already has an incumbent data and analytics environment, it makes sense to leverage existing methods, practices, and staff skills. Let the Data Scientists focus on identifying the next big idea and the heavy analytics; let the rest of the team deal with all of the other work.
Not long ago, I was asked to review a client’s program initiative that was focused on constructing a new customer repository that would establish a single version of truth. The client was very excited about using Master Data Management (MDM) to deliver their new customer view. The problem statement was well thought out: their customer data is spread across 11 different systems; users and developers retrieve data from different sources; reports reflect conflicting details; and an enormous amount of manual effort is required to manage the data. The project’s benefits were also well thought out: increased data quality, improved reporting accuracy, and improved end user data access. And, (as you can probably imagine), the crowning objective of the project was going to be creating a Single View of the Customer. The program’s stakeholders had done a good job of communicating the details: they reviewed the existing business challenges, identified the goals and objectives, and even provided a summary of high-level requirements. They were going to house all of their customer data on an MDM hub. There was only one problem: they needed a customer data mart, not an MDM hub.
I hate the idea of discussing technical terms and details with either business or IT staff. It gets particularly uncomfortable when someone was misinformed about a new technology (and this happens all the time when vendors roll out new products to their sales force). I won’t count the number of times that I’ve seen projects implemented with the wrong technology, because the organization wanted to get a copy of the latest and greatest technical toy. A few of my colleagues and I used to call this the “bright shiny project syndrome”. While it’s perfectly acceptable to acquire a new technology to solve a problem, it can be a very expensive to purchase a technology and force fit a solution that it doesn’t easily address.
It’s frequent that folks confuse the function and purpose of Master Data Management with Data Warehousing. I suspect the core of the problem is that when folks hear about the idea of “reference data” or a “golden record”, they have this mental picture of a single platform containing all of the data. While I can’t argue with the benefit of having all the data in one place (data warehousing has been around for more than 20 years), that’s not what MDM is about. Data Warehousing became popular because of its success in storing a company’s historical data to support cross-functional (multi-subject area) analysis. MDM is different; it’s focused on reconciling and tracking a single subject area’s reference data across the multitude of systems that create that data. Some examples of a subject area include customer, product, and location.
If you look at the single biggest obstacle in data integration, it’s dealing with all of the complexity of merging data from different systems. It’s fairly common for different application systems to use different reference data (The CRM system, the Sales system, and the Billing system each use different values to identify a single customer). The only way to link data from these different systems is to compare the reference data (names, addresses, phone numbers, etc.) from each system with the hope that there are enough identical values in each to support the match. The problem with this approach is that it simply doesn’t work when a single individual may have multiple name variations, multiple addresses, and multiple phone numbers. The only reasonable solution is the use of advanced algorithms that are specially designed to support the processing and matching of specific subject area details. That’s the secret sauce of MDM – and that’s what’s contained within a commercial MDM product.
The MDM hub not only contains master records (the details identifying each individual subject area entry), it also contains a cross reference list of each individual subject area entry along with the linkage details to every other application system. And, it’s continually updated as the values change within each individual system. The idea is that an MDM hub is a high performance, transactional system focused on matching and reconciling subject area reference data. While we’ve illustrated how this capability simplifies data warehouse development, this transactional capability also enables individual application systems to move and integrate data between transactional systems more efficiently too.
The enormous breadth and depth of corporate data makes it impractical to store all of our data within a single system. It’s become common practice to prune and trim the contents of our data warehouses to limit the breadth and history of data. If you consider recent advances with big data, cloud computing, and SaaS, it becomes even more apparent that storing all of a company’s subject area data in a single place isn’t practical. That’s one of the reasons that most companies have numerous data marts and operational applications integrating and loading their own data to support their highly diverse and unique business needs. An MDM hub is focused on tracking specific subject area details across multiple systems to allow anyone to find, gather, and integrate the data they need from any system.
I recently crossed paths with the above mentioned client. Their project was wildly successful – they ended up deploying both an MDM hub and a customer data mart to address their needs. They mentioned that one of the “aha” moments that occurred during our conversation was when they realized that they needed to refocus everyone’s attention towards the business value and benefits of the project instead of the details and functions of MDM. While I was thrilled with their program’s success, I was even more excited to learn that someone was finally able to compete against the “bright shiny project syndrome” and win.