Archive | July 2010

The Problem with Total Cost of Ownership


The issue of Total Cost of Ownership (TCO) seems to come and go every few years. The need for it tends to ebb and flow with corporate budget cycles. TCO is perfectly fine for well-understood commodity functions or defined business processes. If I have to replace a server or a printer, or change a business process, TCO is a perfectly rational metric for comparing different alternatives.

When TCO calculations work, they tend to roll up within a single organization or manager. The hardware, the software, the installation, and the maintenance are under the domain of a single organization that covers the direct cost.

The problem with TCO arises when it’s used as a metric for justifying cross-functional or analytical systems. With these systems, the value isn’t delivering commodity processing but rather supporting decision making. TCO focuses on construction and maintenance costs. But for analytical systems, usage occurs across different organizations and varies with business value and need. TCO can in fact be misapplied.

At a simple level, TCO is often limited to processing hardware, storage, software, and IT resources necessary to configure and manage the platform on an ongoing basis.  But this is usually limited to IT staff focused on system development and maintenance. Unfortunately the most expensive cost—not normally included in TCO calculations—is the business user’s time. While TCO quantifies costs for a data warehouse developer, there is no clear way to calculate costs for the dozens or hundreds of business users who are actually analyzing data and creating reports every day. The reality of analytical systems is that development continues every day on the business side.

Nevertheless it’s common for TCO calculations to be reduced to the cost of processing or storage, rather than reflecting the exponential costs of users circumventing slow-running queries and inaccurate data.  At the end of the day, TCO shouldn’t only be about the cost of hardware and software installation and maintenance. It should be about the cost of continued business usage.

photo by -Luz- via Flickr (Creative Commons license)

Complex Event Processing: Challenging Real-Time ETL

Cave Swallow by Orin Zebest via Flickr (Creative Commons)

Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.

The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record?  Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?

I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.

When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.

There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer.  In the world of CEP we’d simply query the initial message and make the best offer.

Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?

The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.

photo by Orin Zebest via Flickr (Creative Commons license)

The Flaw of the Data Inventory

Grecian Urn 2

Back when I was applying to college, I’d read over college catalogs. Inevitably, each university would mention the number of books it had in its library. When I finally went to college, I realized that this metric was fairly meaningless. A dozen volumes on Grecian pottery did me no good when I was in search of a book on polymers for my mechanical engineering class.

Clients will often ask us to scope a “data inventory” project, inevitably focused on identifying and describing all the data elements contained across their different application systems. Recently a new CIO asked us to head up a “tiger team” to inventory his company’s data. He was surprised at the quantity of information needs that had been sent his way. As expected, he inquired about systems of record and data dictionaries. As you can imagine, he received multiple and conflicting answers which only exacerbated his confusion.

As a point of reference, well-known ERP systems can have in excess of 50,000 discrete data elements in their databases (never mind that some aren’t in English). As I’ve written in the past, many of these data elements have no use outside of the application itself.

Having terabyte upon terabyte of information is equally irrelevant if that data is unrelated to current business issues. The problem with a data inventory activity is that identifying and counting data elements in different systems and applications won’t necessarily solve any problems. Why? Because data across applications and packages is inconsistent: there are different names, definitions, and values, and there is no practical means of determining which data they actually have in common. This is like going to the hardware store and looking for a specific screw, but all the different screws are in one big barrel—you end up having to pick through each screw, one at time. When you find the screw, you just throw all the other screws back into the barrel.

The point of a data inventory isn’t to pick through data because it exists, but to inventory the data people actually need. If you’re going to undertake a data inventory, your output should be structured so that the next person doesn’t have to repeat your work.  Identify the data that is moving across various systems, as this indicates key information that’s being shared. Categorize this data by subject area. You’ll inevitably find that there are inconsistent versions of the data, enabling you to identify data disparities. You can then begin to develop a catalog of key corporate data that will form the basis of your data dictionary.

Inventorying the data that moves between systems accomplishes two things: it identifies the most valuable data elements in use, and it will also help identify data that’s not high-value, as it’s not being shared or used. This approach also provides a way to tackle initial data quality efforts by identifying the most “active” data used by the business. It ultimately helps the data management team understand where to focus its efforts, and prioritize accordingly.

So next time someone suggests a data inventory without context or objectives, consider sending them to college to study Grecian urns.

%d bloggers like this: