One of the challenges in delivering successful data-centric projects (e.g. analytics, BI, or reporting) is realizing that the definition of project success differs from traditional IT application projects. Success for a traditional application (or operational) project is often described in terms of transaction volumes, functional capabilities, processing conformance, and response time; data project success is often described in terms of business process analysis, decision enablement, or business situation measurement. To a business user, the success of a data-centric project is simple: data usability.
It seems that most folks respond to data usability issues by gravitating towards a discussion about data accuracy or data quality; I actually think the more appropriate discussion is data knowledge. I don’t think anyone would argue that to make data-enabled decisions, you need to have knowledge about the underlying data. The challenge is understanding what level of knowledge is necessary. If you ask a BI or Data Warehouse person, their answer almost always includes metadata, data lineage, and a data dictionary. If you ask a data mining person, they often just want specific attributes and their descriptions — they don’t care about anything else. All of these folks have different views of data usability and varying levels (and needs) for data knowledge.
One way to improve data usability is to target and differentiate the user audience based on their data knowledge needs. There are certainly lots of different approaches to categorizing users; in fact, every analyst firm and vendor has their own model to describe different audience segments. One of the problems with these types of models is that they tend to focus heavily on the tools or analytical methods (canned reports, drill down, etc.) and ignore the details of data content and complexity. The knowledge required to manipulate a single subject area (revenue or customer or usage) is significantly less than the skills required to manipulate data across 3 subject areas (revenue, customer, and usage). And what exacerbates data knowledge growth is the inevitable plethora of value gaps, inaccuracies, and inconsistencies associated with the data. Data knowledge isn’t just limited to understanding the data; it includes understanding how to work around all of the imperfections.
Here’s a model that categories and describes business users based on their views of data usability and their data knowledge needs
Level 1: “Can you explain these numbers to me?”
This person is the casual data user. They have access to a zillion reports that have been identified by their predecessors and they focus their effort on acting on the numbers they get. They’re not a data analyst – their focus is to understand the meaning of the details so they can do their job. They assume that the data has been checked, rechecked, and vetted by lots of folks in advance of their receiving the content. They believe the numbers and they act on what they see.
Level 2: “Give me the details”
This person has been using canned reports, understands all the basic details, and has graduated to using data to answer new questions that weren’t identified by their predecessors. They need detailed data and they want to reorganize the details to suit their specific needs (“I don’t want weekly revenue breakdowns – I want to compare weekday revenue to weekend revenue”). They realize the data is imperfect (and in most instances, they’ll live with it). They want the detail.
Level 3: “I don’t believe the data — please fix it”
These folks know their area of the business inside/out and they know the data. They scour and review the details to diagnose the business problems they’re analyzing. And when they find a data mistake or inaccuracy, they aren’t shy about raising their hand. Whether they’re a data analyst that uses SQL or a statistician with their favorite advanced analytics algorithms, they focus on identifying business anomalies. These folks are the power users that are incredibly valuable and often the most difficult for IT to please.
Level 4: “Give me more data”
This is subject area graduation. At this point, the user has become self-sufficient with their data and needs more content to address a new or more complex set of business analysis needs. Asking for more data – whether a new source or more detail – indicates that the person has exhausted their options in using the data they have available. When someone has the capacity to learn a new subject area or take on more detailed content, they’re illustrating a higher level of data knowledge.
One thing to consider about the above model is that a user will have varying data knowledge based on the individual subject area. A marketing person may be completely self-sufficient on revenue data but be a newbie with usage details. A customer support person may be an expert on customer data but only have limited knowledge of product data. You wouldn’t expect many folks (outside of IT) to be experts on all of the existing data subject areas. Their knowledge is going to reflect the breadth of their job responsibilities.
As someone grows and evolves in business expertise and influence, it’s only natural that their business information needs would grow and evolve too. In order to address data usability (and project success), maybe it makes sense to reconsider the various user audience categories and how they are defined. Growing data knowledge isn’t about making everyone data gurus; it’s about enabling staff members to become self-sufficient in their use of corporate data to do their jobs.
Photo “Ladder of Knowledge” courtesy of degreezero2000 via Flickr (Creative Commons license).
The issue of Total Cost of Ownership (TCO) seems to come and go every few years. The need for it tends to ebb and flow with corporate budget cycles. TCO is perfectly fine for well-understood commodity functions or defined business processes. If I have to replace a server or a printer, or change a business process, TCO is a perfectly rational metric for comparing different alternatives.
When TCO calculations work, they tend to roll up within a single organization or manager. The hardware, the software, the installation, and the maintenance are under the domain of a single organization that covers the direct cost.
The problem with TCO arises when it’s used as a metric for justifying cross-functional or analytical systems. With these systems, the value isn’t delivering commodity processing but rather supporting decision making. TCO focuses on construction and maintenance costs. But for analytical systems, usage occurs across different organizations and varies with business value and need. TCO can in fact be misapplied.
At a simple level, TCO is often limited to processing hardware, storage, software, and IT resources necessary to configure and manage the platform on an ongoing basis. But this is usually limited to IT staff focused on system development and maintenance. Unfortunately the most expensive cost—not normally included in TCO calculations—is the business user’s time. While TCO quantifies costs for a data warehouse developer, there is no clear way to calculate costs for the dozens or hundreds of business users who are actually analyzing data and creating reports every day. The reality of analytical systems is that development continues every day on the business side.
Nevertheless it’s common for TCO calculations to be reduced to the cost of processing or storage, rather than reflecting the exponential costs of users circumventing slow-running queries and inaccurate data. At the end of the day, TCO shouldn’t only be about the cost of hardware and software installation and maintenance. It should be about the cost of continued business usage.
photo by -Luz- via Flickr (Creative Commons license)
Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.
The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record? Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?
I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.
When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.
There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer. In the world of CEP we’d simply query the initial message and make the best offer.
Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?
The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.
photo by Orin Zebest via Flickr (Creative Commons license)
Back when I was applying to college, I’d read over college catalogs. Inevitably, each university would mention the number of books it had in its library. When I finally went to college, I realized that this metric was fairly meaningless. A dozen volumes on Grecian pottery did me no good when I was in search of a book on polymers for my mechanical engineering class.
Clients will often ask us to scope a “data inventory” project, inevitably focused on identifying and describing all the data elements contained across their different application systems. Recently a new CIO asked us to head up a “tiger team” to inventory his company’s data. He was surprised at the quantity of information needs that had been sent his way. As expected, he inquired about systems of record and data dictionaries. As you can imagine, he received multiple and conflicting answers which only exacerbated his confusion.
As a point of reference, well-known ERP systems can have in excess of 50,000 discrete data elements in their databases (never mind that some aren’t in English). As I’ve written in the past, many of these data elements have no use outside of the application itself.
Having terabyte upon terabyte of information is equally irrelevant if that data is unrelated to current business issues. The problem with a data inventory activity is that identifying and counting data elements in different systems and applications won’t necessarily solve any problems. Why? Because data across applications and packages is inconsistent: there are different names, definitions, and values, and there is no practical means of determining which data they actually have in common. This is like going to the hardware store and looking for a specific screw, but all the different screws are in one big barrel—you end up having to pick through each screw, one at time. When you find the screw, you just throw all the other screws back into the barrel.
The point of a data inventory isn’t to pick through data because it exists, but to inventory the data people actually need. If you’re going to undertake a data inventory, your output should be structured so that the next person doesn’t have to repeat your work. Identify the data that is moving across various systems, as this indicates key information that’s being shared. Categorize this data by subject area. You’ll inevitably find that there are inconsistent versions of the data, enabling you to identify data disparities. You can then begin to develop a catalog of key corporate data that will form the basis of your data dictionary.
Inventorying the data that moves between systems accomplishes two things: it identifies the most valuable data elements in use, and it will also help identify data that’s not high-value, as it’s not being shared or used. This approach also provides a way to tackle initial data quality efforts by identifying the most “active” data used by the business. It ultimately helps the data management team understand where to focus its efforts, and prioritize accordingly.
So next time someone suggests a data inventory without context or objectives, consider sending them to college to study Grecian urns.
As I wrote in last week’s blog post, a data warehouse appliance simplifies platform and system resource administration. It doesn’t simplify the traditional time-intensive efforts of managing and integrating disparate data and addressing performance and tuning of various applications that contend for the same resources.
Many data warehouse appliance vendors offer sophisticated parallel processing environments, query optimization, and specialized storage structures to improve query processing (e.g., columnar-based engines). It’s naïve to think that taking data from an SMP (Symmetric Multi-Processing) relational database and moving it into a parallel processing environment will effectively scale without any adjustments or changes. Moving onto an appliance can be likened to moving into a new house. When you move into a new, larger house, you quickly learn that it’s not as simple as dumping all of your stuff into the new house. The different dimensions of the new rooms cause you realize that some of your old furniture or rugs simple don’t fit. You inevitably have to make adjustments if you want to truly enjoy your new home. The same goes with a data warehouse appliance; it likely has numerous features to support growth and scalability; you have to make adjustments to leverage their benefits.
Companies that expect to simply dump their data from a few legacy data marts over to a new appliance should expect to confront some adjustments or their likely to experience some unpleasant surprises. Here are some that we’ve already seen.
Everyone agrees that the biggest cost issue behind building a data warehouse is ETL design and development. Hoping to migrate existing ETL jobs into a new hardware and processing environment without expecting rework is short-sighted. While you can probably force fit your existing job streams, you’ll inevitably misuse the new system, waste system resources, and dramatically reduce the lifespan of the appliance. Each appliance has its own way of handling the intensive resource requirements of data loading – in much the same way that each incumbent database product addresses these same situations. If you’ve justified an appliance through the benefits of consolidating multiple data marts (that contain duplicate data), it only makes sense to consolidate and integrate the ETL processes to prevent processing duplication and waste.
To assume that because you’ve built your ETL architecture leveraging the latest and greatest ETL software technology that you won’t have to review the underlying ETL architecture is also misguided. While there’s no question that migrating tool-based ETL jobs to a new platform can be much easier than lower-level code, the issue at hand isn’t the source and destination– it’s the underlying table structures. Not every table will change in definition on a new platform, but the largest (and most used) table content is the most likely candidate for review and redesign. Each appliance handles data distribution and database design differently. Consequently, since the underlying table structures are likely to require adjustment, plan on a redesign of the actual ETL process too.
I’m also surprised by the casual attitude regarding technical training. After all, it’s just a SQL database, right? But application developers and data warehouse development staff need to understand the differences of the appliance product (after all, it’s a different database version or product). While most of this knowledge can be gained through reading the manuals – when was the last time the DBAs or database developers actually had a full-set of manuals—much less the time required to read them? The investment in training isn’t significant—usually just a few days of classes. If you’re going to provide your developers with a product that claims to bigger, better, and faster than its competitors, doesn’t it make sense to prepare them adequately to use it?
There’s also an assumption that—since most data warehouse appliance vendors are software-only—that there are no hardware implications. On the contrary, you should expect to change your existing hardware. The way memory and storage are configured on a data warehouse appliance can differ from a general-purpose server, but it’s still rare that the hardware costs are factored into the development plan. And believing that older servers can be re-purposed has turned out to be a myth. If you ‘re attempting to support more storage, more processing, and more users, how can using older equipment (with the related higher maintenance costs) make financial sense?
You could certainly fork-lift your data, leave all the ETL jobs alone, and not change any processing. Then again, you could save a fortune on a new data warehouse appliance and simply do nothing. After all, no one argues with the savings associated with doing nothing—except, of course, the users that need the data to run your business.
photo by Bien Stephenson via Flickr (Creative Commons License)
Many of our clients have asked us about whether it’s time to consider replacing their aging data warehouses with data warehouse appliance technologies. I chock up this emerging interest to the reality that data warehouse life spans are 3 to 4 years and platforms need to be refreshed. Given the recent crop of announcements by vendors like Oracle and Teradata along with the high visibility of newer players like Netezza, Paraccel, and Vertica.
The benefit of a data warehouse appliance is that includes all of the hardware and software in a preconfigured solution that dramatically simplifies running and managing a data warehouse. (Some of the vendors have taken that one step further and actually sell software that is setup to work with specially defined commodity hardware configurations). Given the price/performance differences between the established data warehouse products and the newer data warehouse appliances, it only makes sense that these products be considered as alternatives to simply upgrading the hardware.
The data warehouse appliance market is arguably not new. In the 1980s companies like Britton-Lee and Teradata argued that database processing was different and would perform better with purpose-designed hardware and software. Many have also forgotten these pioneers argued that the power of commodity microprocessors vastly exceeded the price/performance of their mainframe processor competitors.
The current-generation appliance vendors have been invited to the table because of the enormous costs that have evolved in managing the enormous data volumes and operational access associated with today’s data warehouses. Most IT shops have learned that database scalability doesn’t just mean throwing more hardware and storage at the problem. The challenge in managing these larger environments is understand the dynamics of the data content and the associated processing. That’s why partitioning the data across multiple servers or simply removing history doesn’t work – for every shortcut taken to reduce the data quantity, there’s an equal impact to user access and the single version of truth. This approach also makes data manipulation and even system support dramatically more complicated.
It’s no surprise that these venture capital backed firms would focus on delivering a solution that was simpler to configure and manage. The glossy sales message of data warehouse appliance vendors comes down’ to something like: “We’ve reduced the complexity of running a data warehouse.. Just install our appliance like a toaster, and watch it go!” There’s no question that many of these appliance vendors have delivered when it comes to simplifying platform management and configuration; the real challenge is addressing the management and configuration issues that impact a growing data warehouse: scalable load processing, a flexible data architecture, and manageable query processing.
We’ve already run into several early-adopters that think all that is necessary is to simply fork-lift their existing data warehouse structures onto their new appliance. While this approach may work initially, the actual longevity of the appliance – or its price/performance rationale will soon evaporate. These new products can’t work around bad data, poor design habits, and the limitations of duplicate data; their power is providing scalability across enormous data and processing volumes. An appliance removes the complexities of platform administration. But no matter what appliance you purchase, and no matter how much horsepower it has, data architecture and data administration are still required.
In order to leverage the true power of an appliance, you have to expect to focus effort towards integrating data in a structure that leverages the scalability strengths of the product. While the appliances are SQL-based, the way they process loads, organize data, and handle queries can be dramatically different than their incumbent data marts and data warehouses. It’s naïve to think that a new appliance can provide processing scalability without any adjustments. If it was that simple, the incumbent vendors would have already packaged that in their existing products.
In Part 2 of this post, I’ll elaborate on the faulty assumptions of many companies that acquire data warehouse appliances, and warn you against making these same mistakes.
photo by meddygarnet via Flickr (Creative Commons License)
There are far too many data warehouse development teams solely focused on loading data. They’ve completely lost sight of their success metrics.
Why have they fallen into this rut? Because they’re doing what they’ve always done. One of the challenges in data warehousing is that as time progresses the people on the data warehouse development team are often not the same people who launched the team. This erosion of experience has eroded the original vision and degraded the team’s effectiveness .
One client of mine actually bonused their data warehouse development team based on system usage and capacity. Was there a lot of data in the data warehouse? Yep. Were there multiple sandboxes, each with its own copy of data? Yep. Was this useful three years ago? Yep. Does any of this matter now? Nope. The original purpose of the data warehouse—indeed, the entire BI program—has been forgotten.
In the beginning the team understood how to help the business. They were measured on business impact. Success was based on new revenues and lower costs outside of IT. The team understood the evolution of the applications and data to support BI was critical in continually delivering business value. There was an awareness of what was next. Success was based on responding to the new business need. Sometimes this meant reusing data with new reports, sometimes it meant new data, sometimes it was just adjusting a report. The focus was on aggressively responding to business change and the resulting business need.
How does your BI team support decision making? Does it still deliver value to business users? Maybe your company is like some of the companies that I’ve seen: the success of the data warehouse and the growth of its budget propelled it into being managed like an operational system. People have refocused their priorities to managing loads, monitoring jobs, and throwing the least-expensive, commodity skills at the program. So a few years after BI is introduced, the entire program has become populated with IT order-takers, watching and managing extracts, load jobs, and utilization levels.
Then an executive asks: “Why is this data warehouse costing us so much?”
You’ve built applications, you’ve delivered business value, and you’ve managed your budget. Good for you. But now you have to do more. IT’s definition of data warehouse success is you cutting your budget. Why? Because IT’s definition of success isn’t business value creation, it’s budget conformance.
Because BI isn’t focused on business operation automation, as with many operational systems, it can’t thrive in a maintenance-driven mode. In order to continue to support the business, BI must continually deliver new information and new functionality. Beware the IT organization that wants to migrate the data warehouse to an operational support model measured on budgets, not business value. This can jeopardize more than just your next platform upgrade, it can imperil the BI program itself. The tunnel-vision of Service Level Agreements, manpower estimates, and project plan maintenance aren’t doing you any favors. They can’t be done devoid of business drivers.
When there are new business needs, business users may try to enlist IT resources to support them. But they no longer see partners who will help realize their visions and deliver valuable analytics. They see a few low-cost, less experienced technicians monitoring system uptime and staring at the blinking lights.
photo by jurvetson via Flickr (Creative Commons License)