I was recently asked about my opinion for the potential of Hadoop replacing a company’s data warehouse (DW). While there’s lots to be excited about when it comes to Hadoop, I’m not currently in the camp of folks that believe it’s practical to use Hadoop to replace a company’s DW. Most corporate DW systems are based on commercial relational database products and can store and manage multiple terabytes of data and support hundreds (if not thousands) of concurrent users. It’s fairly common for these systems to handle complex, mixed workloads –queries processing billions of rows across numerous tables along with simple primary key retrieval requests all while continually loading data. The challenge today is that Hadoop simply isn’t ready for this level of complexity.
All that being said, I do believe there’s a huge opportunity to use Hadoop to replace a significant amount of processing that is currently being handled by most DWs. Oh, and data warehouse user won’t be affected at all.
Let’s review a few fundamental details about the DW. There’s two basic data processing activities that occur on a DW: query processing and transformation processing. Query processing is servicing the SQL that’s submitted from all of the tools and applications on the users’ desktops, tablets, and phones. Transformation processing is the workload involved with converting data from their source application formats to the format required by the data warehouse. While the most visible activity to business users is query processing, it is typically the smaller of the two. Extracting and transforming the dozens (or hundreds) of source data files for the DW is a huge processing activity. In fact, most DWs are not sized for query processing; they are sized for the daily transformation processing effort.
It’s important to realize that one of the most critical service level agreements (SLAs) of a DW is data delivery. Business users want their data first thing each morning. That means the DW has to be sized to deliver data reliably each and every business morning. Since most platforms are anticipated to have a 3+ year life expectancy, IT has to size the DW system based on the worst case data volume scenario for that entire period (end of quarter, end of year, holidays, etc.) This means the DW is sized to address a maximum load that may only occur a few times during that entire period.
This is where the opportunity for Hadoop seems pretty obvious. Hadoop is a parallel, scalable framework that handles distributed batch processing and large data volumes. It’s really a set of tools and technologies for developers, not end users. This is probably why so many ETL (extract, transformation, and load) product vendors have ported their products to execute within a Hadoop environment. It only makes sense to migrate processing from a specialized platform to commodity hardware. Why bog down and over invest in your DW platform if you can handle the heavy lifting of transformation processing on a less expensive platform?
Introducing a new system to your DW environment will inevitably create new work for your DW architects and developers. However, the benefits are likely to be significant. While some might view such an endeavor as a creative way to justify purchasing new hardware and installing Hadoop, the real reason is to extend the life of the data warehouse (and save your company a bunch of money by deferring a DW upgrade)
I read an interesting tidbit about data the other day: the United States Postal Service processed more than 47 million changes of addresses in the last year. That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality. The idea of stale data is that as data ages, its accuracy and associated business rules can change.
There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality: people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change. Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value. The challenge with maintaining any asset is establishing a reasonable maintenance plan.
Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention. In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner. Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.) This happens so often, most data warehouse folks have even given it a name: Recasting History. Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.
- Establish a regularly scheduled data maintenance window. Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
- Collect and publish data quality details. Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
- Keep the original data. Most data quality processing overwrites original content with new details. Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
- Add source system identification and creation date/time details to every record. While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
- Schedule a regular data change control meeting. This too is similar in concept to the change control meeting associated with IT operations teams. This is a forum for discussing data content issues and changes
Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun. Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting. They simply don’t have the resources or time to address data maintenance as a proactive activity.
Business users clamor for new data and new reports; new funding is always tied to new business capabilities. Support costs are budgeted, but they’re focused on software and hardware maintenance activities. No one ever considers data maintenance; it’s simply ignored and forgotten.
Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement. And where are we investing in maintenance? The commodity tools, not the strategic corporate asset.
Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).
In one of my previous blogs, I wrote about Data Virtualization technology — one of the more interesting pieces of middleware technology that can simplify data management. While most of the commercial products in this space share a common set of features and functions, I thought I’d devote this blog to discussing the more advanced features. There are quite a few competing products; the real challenge in differentiating the products is to understand their more advanced features.
The attraction of data virtualization is that it simplifies data access. Most IT shops have one of everything – and this includes several different brands of commercial DBMSs, a few open source databases, a slew of BI/reporting tools, and the inevitable list of emerging and specialized tools and technologies (Hadoop, Dremel, Casandra, etc.) Supporting all of the client-to-server-to-repository interfaces (and the associated configurations) is both complex and time consuming. This is why the advanced capabilities of Data Virtualization have become so valuable to the IT world.
The following details aren’t arranged in any particular order. I’ve identified the ones that I’ve found to be the most valuable (and interesting). Let me also acknowledge not every DV product supports all of these features.
Intelligent data caching. Repository-to-DV Server data movement is the biggest obstacle in query response time. Most DV products are able to support static caching to reduce repetitive data movement (data is copied and persisted in the DV Server). Unfortunately, this approach has limited success when there are ad hoc users accessing dozens of sources and thousands of tables. The more effective solution is for the DV Server to monitor all queries and dynamically cache data based on user access, query load, and table (and data) access frequency.
Query optimization (w/multi-platform execution). While all DV products claim some amount of query optimization, it’s important to know the details. There are lots of tricks and techniques; however, look for optimization that understands source data volumes, data distribution, data movement latency, and is able to process data on any source platform.
Support for multiple client Interfaces. Since most companies have multiple database products, it can be cumbersome to support and maintain multiple client access configurations. The DV server can act as a single access point for multiple vendor products (a single ODBC interface can replace drivers for each DBMS brand). Additionally, most DV Server drivers support multiple different access methods (ODBC, JDBC, XML, and web services).
Attribute level or value specific data security. This feature supports data security at a much lower granularity than is typically available with most DBMS products. Data can be protected (or restricted) at individual column values for entire table or selective rows.
Metadata tracking and management. Since Data Virtualization is a query-centric middleware environment, it only makes sense to position this server to retrieve, reconcile, and store metadata content from multiple, disparate data repositories.
Data lineage. This item works in tandem with the metadata capability and augments the information by retaining the source details for all data that is retrieved. This not only includes source id information for individual records but also the origin, creation date, and native attribute details.
Query tracking for usage audit. Because the DV Server can act as a centralized access point for user tool access, there are several DV products that support the capture and tracking of all submitted queries. This can be used to track, measure, and analyze end user (or repository) access.
Workflow linkage and processing. This is the ability to execute predefined logic against specific data that is retrieved. While this concept is similar to a macro or stored procedure, it’s much more sophisticated. It could include the ability to direct job control or specialized processing against an answer set prior to delivery (e.g. data hygiene, external access control, stewardship approval, etc.)
Packaged Application Templates. Most packaged applications (CRM, ERP, etc.) contain thousands of tables and columns that can be very difficult to understand and query. Several DV vendors have developed templates containing predefined DV server views that access the most commonly queried data elements.
Setup and Configuration Wizards. Configuring a DV server to access the multiple data sources can be a very time consuming exercise; the administrator needs to define and configure every source repository, the underlying tables (or files), along with the individual data fields. To simplify setup, a configuration wizard reviews the dictionary of an available data source and generates the necessary DV Server configuration details. It further analyzes the table and column names to simplify naming conventions, joins, and data value conversion and standardization details.
Don’t be misled into thinking that Data Virtualization is a highly mature product space where all of the products are nearly identical. They aren’t. Most product vendors spend more time discussing their unique features instead of offering metrics about their their core features. It’s important to remember that every Data Virtualization product requires a server that retrieves and processes data to fulfill query requests. This technology is not a commodity, which means that details like setup/configuration time, query performance, and advanced features can vary dramatically across products. Benchmark and test drive the technology before buying.
IT organizations have spent enormous sums of money over the past 10-15 years attacking productivity. They’ve acquiring data integration tools, implemented improved development methodologies, and even reengineered requirements gathering methods to ensure business priority alignment. And the result of all of this investment? Today’s data integration developers are easily 10x to 20x more productive than the COBOL programmers of the past. This shouldn’t be a surprise to anyone – writing, compiling, linking, and testing 3rd generation code is much slower than today’s GUI-based, drag-and-drop development tools. The tools work; developers are faster, quicker, and better.
So, why does it still seem to take an eternity and cost a fortune to acquire and integrate new data into an existing report? The bottleneck has moved upstream: finding and extracting source data is complicated and time consuming. We’ve invested in our Integration Competency Centers to create an assembly line to streamline the process of transforming and converting data that is loaded into databases or applications. Unfortunately, we’ve not devoted any effort in simplifying access or understanding the actual raw source data that feeds the assembly line.
Henry Ford didn’t invent the assembly line, he revolutionized it. One of the changes that he introduced to the assembly line was simplifying and standardizing parts and the actual assembly process. Prior to Ford’s assembly line, car assembly was a custom effort that required highly trained craftsmen to shape, tool, and fit parts by hand (in a very time consuming process). The parts weren’t always uniform, so the craftsmen had to spend a significant amount of time fitting the parts together.
In most IT environments, source system access and data content varies across the different application systems dramatically. This forces developers to become data craftsmen in order to deal with the data idiosyncrasies associated with the numerous source systems common to most companies. Every system stores data in a custom and unique manner; it takes a lot of time to search and analyze source system data in order to identify the necessary content. (A popular ERP package stores its details in more than 10,000 tables) So, each new request often requires developers to create “from scratch” code to access and manipulate new data from a source system. If you dig a bit, you’ll probably find that many of your application systems generate dozens or hundreds (yes, hundreds) of custom extracts to deliver data to support the various production business needs within your company.
While most folks might think that custom extracts are a reasonably decent solution, they’re not. In fact, they’re a problem that will only get worse with time. (Remember, every extract requires development time and ongoing support.) You’ll be better off consolidating all of those extracts into a single set that includes all of the data. This will reduce processing time, reduce storage, reduce maintenance, and ultimately save a lot of money. You’ll have to spend some time designing and building these new extracts and getting folks to migrate to using them, but the benefits will be significant. (One of my clients was able to defer a platform upgrade due to the CPU and storage reduction caused by the consolidation and removal of all of the custom extracts).
Standardizing source data to reduce the data craftsmen problem isn’t rocket science, but it’s more than simply creating a data dump or generating a backup file. You need to deliver data in a manner that can be quickly and easily consumed by other systems. This means that the content needs to be reformatted from the unique (sometimes indecipherable) format of the host application into a format that everyone else can use. This can be easily addressed by delivering data into database tables or flat files (I know one client that delivers data in tab delimited spreadsheet format). The data should reflect the values generated by the source system in a format that everyone can understand – the content shouldn’t be modified for cleansed (this is source data, not content ready for business consumption). Delivery should occur in a frequent and regular basis along with a plan for archiving a decent amount of history.
This isn’t a new concept; this was a common approach in the days when custom coded IBM mainframe applications were all the rage. Back then, data sharing was a priority and every application generated standard extracts to reduce I/O and storage costs. There was also an extreme sensitivity to developer time. Requesting a custom extract was frowned upon and rarely approved. Finding and accessing the data was as simple as referencing the extract files that were made available from every application system.
When it comes to improving the delivery speed of new data to business users, maybe we can learn something from Henry Ford and the world of mainframe development.
Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
At the recent Gartner MDM Summit in Las Vegas I was approached at least a half a dozen times by people wondering what MDM vendor to choose. I gave my usual response, which was, “What are you trying to accomplish?”
Normally a (short) conversation ensues of functions, feeds and speeds, which then leads to my next question, “So, what are your priorities and decision criteria? The responses were all the same, and I have to admit that they surprised me.
“We know we need MDM, but our company hasn’t really decided what MDM is. Since we’re already a [Microsoft / IBM / SAP / Oracle / SAS] shop, we just thought we’d buy their product…so what do you think of their product?”
I find this type of question interesting and puzzling. Why would anyone blindly purchase a product because of the vendor, rather than focusing on needs, priorities, and cost metrics? Unless a decision has absolutely no risk or cost, I’m not clear how identifying a vendor before identifying the requirements could possibly have a successful outcome.
If I look in my refrigerator, not all my products have the same brand label. My taste, interests, and price tolerance vary based upon the product. My catsup comes from one company, my salad dressing comes from another, and I have about seven different types of mustard (long story). Likewise, my TV, DVD player, surround sound system, DVR, and even my remote control are all different brands. Despite the advertisers’ claims, no single company has the best feature set across all products. For those of you who are loyal to a single brand, you can stop reading now. I’m sure you think I’m nuts.
The fact is that different vendors have different strengths, and this causes their products to differ. Buyers of these products should focus on their requirements and needs, not the product’s functions and features. Somehow this type of logic seems to escape otherwise smart business people. A good decision can deliver enormous benefits to a company; a bad decision can deliver enormous benefits to a company’s competitors.
What other reason would there be for someone saying, “We’re a [vendor name here] shop?” Examples abound of vendors abandoning products. IBM’s Intelligent Miner data mining tool, OS/2, the Apple Newton, Microsoft Money are but a few of the many examples.
Working with a reputable vendor is smart. Gathering requirements, reviewing product features, and determining the best match creates the opportunity for developing a client/vendor partnership. So why would anyone throw all of that out and just decide to pick a vendor? I guess lots of folks thought that Bernie Madoff was their partner. Need I say more?
photo by xJasonRogersx via Flickr (Creative Common License)
I’ve been making the point in the past several years that master data management (MDM) development
projects are different, and are accompanied by unique challenges. Because of the “newness” of MDM and its unique value proposition, MDM development can challenge traditional IT development assumptions.
MDM is very much a transactional processing system; it receives application requests, processes them, and returns a result. The complexities of transaction management, near real-time processing, and the details associated security, logging, and application interfaces are a handful. Most OLTP applications assume that the provided data is usable; if the data is unacceptable, the application simply returns an error. Most OLTP developers are accustomed to addressing these types of functional requirements. Dealing with imperfect data has traditionally been unacceptable because it slowed down processing; ignoring it or returning an error was a best practice.
The difference about MDM development is the focus on data content (and value-based) processing. The whole purpose MDM is to deal with all data, including the unacceptable stuff. It assumes that the data is good enough. MDM code assumes the data is complex and “unacceptable” and focuses on figuring out the values. The development methods associated with deciphering, interpreting, or decoding unacceptable data to make it usable is very different. It requires a deep understanding of a different type of business rule – those associated with data content. Because most business processes have data inputs and data outputs, there can be dozens of data content rules associated with each business process. Traditionally, OLTP developers didn’t focus on the business content rules; they were focused on automating business processes.
MDM developers need to be comfortable with addressing the various data content processing issues (identification, matching, survivorship, etc.) along with the well understood issues of OLTP development (transaction management, high performance, etc.) We’ve learned that the best MDM development environments invest heavily in data analysis and data management during the initial design and development stages. They invest in profiling and analyzing each system of creation. They also differentiate hub development from source on-boarding and hub administration. The team that focuses on application interfaces, CRUD processing, and transaction & bulk processing requires different skills from those developers focused on match processing rules, application on-boarding, and hub administration. The developers focused on hub construction are different than those team members focused on the data changes and value questions coming from data stewards and application developers. This isn’t about differentiating development from maintenance; this is about differentiating the skills associated with the various development activities.
If the MDM team does its job right it can dramatically reduce the data errors that cause application processing and reporting problems. They can identify and quantify data problems so that other development teams can recognize them, too. This is why MDM development is critical to creating the single version of truth.
Image via cafepress.com.