Tag Archive | data warehouse

Hadoop Replacing Data Warehouse Processing

20131126HadoopReplacingDW-SnakeOilSalesMan

I was recently asked about my opinion for the potential of Hadoop replacing a company’s data warehouse (DW).  While there’s lots to be excited about when it comes to Hadoop, I’m not currently in the camp of folks that believe it’s practical to use Hadoop to replace a company’s DW.  Most corporate DW systems are based on commercial relational database products and can store and manage multiple terabytes of data and support hundreds (if not thousands) of concurrent users.  It’s fairly common for these systems to handle complex, mixed workloads –queries processing billions of rows across numerous tables along with simple primary key retrieval requests all while continually loading data.  The challenge today is that Hadoop simply isn’t ready for this level of complexity.

All that being said,  I do believe there’s a huge opportunity to use Hadoop to replace a significant amount of processing that is currently being handled by most DWs.  Oh, and data warehouse user won’t be affected at all.

Let’s review a few fundamental details about the DW. There’s two basic data processing activities that occur on a DW: query processing and transformation processing. Query processing is servicing the SQL that’s submitted from all of the tools and applications on the users’ desktops, tablets, and phones.  Transformation processing is the workload involved with converting data from their source application formats to the format required by the data warehouse. While the most visible activity to business users is query processing, it is typically the smaller of the two.  Extracting and transforming the dozens (or hundreds) of source data files for the DW is a huge processing activity.  In fact, most DWs are not sized for query processing; they are sized for the daily transformation processing effort.

It’s important to realize that one of the most critical service level agreements (SLAs) of a DW is data delivery.  Business users want their data first thing each morning.  That means the DW has to be sized to deliver data reliably each and every business morning.  Since most platforms are anticipated to have a 3+ year life expectancy, IT has to size the DW system based on the worst case data volume scenario for that entire period (end of quarter, end of year, holidays, etc.) This means the DW is sized to address a maximum load that may only occur a few times during that entire period.

This is where the opportunity for Hadoop seems pretty obvious. Hadoop is a parallel, scalable framework that handles distributed batch processing and large data volumes. It’s really a set of tools and technologies for developers, not end users.  This is probably why so many ETL (extract, transformation, and load) product vendors have ported their products to execute within a Hadoop environment.  It only makes sense to migrate processing from a specialized platform to commodity hardware. Why bog down and over invest in your DW platform if you can handle the heavy lifting of transformation processing on a less expensive platform?

Introducing a new system to your DW environment will inevitably create new work for your DW architects and developers. However, the benefits are likely to be significant.  While some might view such an endeavor as a creative way to justify purchasing new hardware and installing Hadoop, the real reason is to extend the life of the data warehouse (and save your company a bunch of money by deferring a DW upgrade)

My Dog Ate the Requirements, Part 2

DogAteRequirements2

There’s nothing more frustrating than not being able to rely upon a business partner.  There’s lots of business books about information technology that espouses the importance of Business/IT alignment and the importance of establishing business users as IT stakeholders. The whole idea of delivering business value with data and analytics is to provide business users with tools and data that can support business decision making.  It’s incredibly hard to deliver business value when half of the partnership isn’t stepping up to their responsibilities.

There’s never a shortage of rationale as to why requirements haven’t been collected or recorded.  In order for a relationship to be successful, both parties have to participate and cooperate.  Gathering and recording requirements isn’t possible if the technologist doesn’t meet with the users to discuss their needs, pains, and priorities.  Conversely, the requirements process won’t succeed if the users won’t participate. My last blog reviewed the excuses that technologists offered for explaining the lack of documented requirements; this week’s blog focuses on remarks I’ve heard from business stakeholders.

  • “I’m too busy.  I don’t have time to talk to developers”
  • “I meet with IT every month, they should know my requirements”
  • “IT isn’t asking me for requirements, they want me to approve SQL”
  • “We sent an email with a list of questions. What else do they need?”
  • “They have copies of reports we create. That should be enough.”
  • “The IT staff has worked here longer than I have.  There’s nothing I can tell them that they don’t already know”
  • “I’ve discussed my reporting needs in 3 separate meetings; I seem to be educating someone else with each successive discussion”
  • “I seem to answer a lot of questions.  I don’t ever see anyone writing anything down”
  • “I’ll meet with them again when they deliver the requirements I identified in our last discussion.
  • “I’m not going to sign off on the requirements because my business priorities might change – and I’ll need to change the requirements.

Requirements gathering is really a beginning stage for negotiating a contract for the creation and delivery of new software.  The contract is closed (or agreed to) when the business stakeholders agree to (or sign-off on) the requirements document.  While many believe that requirements are an IT-only artifact, they’re really a tool to establish responsibilities of both parties in the relationship.

A requirements document defines the data, functions, and capabilities that the technologist needs to build to deliver business value.  The requirements document also establishes the “product” that will be deployed and used by the business stakeholders to support their business decision making activities. The requirements process holds both parties accountable: technologists to build and business stakeholders to use. When two organizations can’t work together to develop requirements, it’s often a reflection of a bigger problem.

It’s not fair for business stakeholders to expect development teams to build commercial grade software if there’s no participation in the requirements process.  By the same token, it’s not right for technologists to build software without business stakeholder participation. If one stakeholder doesn’t want to participate in the requirements process, they shouldn’t be allowed to offer an opinion about the resulting deliverable.  If multiple stakeholders don’t want to participate in a requirements activity, the development process should be cancelled.  Lack of business stakeholder participation means they have other priorities; the technologists should take a hint and work on their other priorities.

Data Quality, Data Maintenance

20121009 DataMaintenance

I read an interesting tidbit about data the other day:  the United States Postal Service processed more than 47 million changes of addresses in the last year.  That’s nearly 1 in 6 people. In the world of data, that factoid is a simple example of the challenge of addressing stale data and data quality.  The idea of stale data is that as data ages, its accuracy and associated business rules can change.

There’s lots of examples of how data in your data warehouse can age and degrade in accuracy and quality:  people move, area codes change, postal/zip codes change, product descriptions change, and even product SKUs can change.  Data isn’t clean and accurate forever; it requires constant review and maintenance. This shouldn’t be much of a surprise for folks that view data as a corporate asset; any asset requires ongoing maintenance in order to retain and ensure its value.  The challenge with maintaining any asset is establishing a reasonable maintenance plan.

Unfortunately, while IT teams are exceptionally strong in planning and carrying out application maintenance, it’s quite rare that data maintenance gets any attention.  In the data warehousing world, data maintenance is typically handled in a reactive, project-centric manner.  Nearly every data warehouse (or reporting) team has to deal with data maintenance issues whenever a company changes major business processes or modifies customer or product groupings (e.g. new sales territories, new product categories, etc.)  This happens so often, most data warehouse folks have even given it a name:  Recasting History.   Regardless of what you call it, it’s a common occurrence and there are steps that can be taken to simplify the ongoing effort of data maintenance.

  • Establish a regularly scheduled data maintenance window.  Just like the application maintenance world, identify a window of time when data maintenance can be applied without impacting application processing or end user access
  • Collect and publish data quality details.  Profile and track the content of the major subject area tables within your data warehouse environment. Any significant shift in domain values, relationship details, or data demographics can be discovered prior to a user calling to report an undetected data problem
  • Keep the original data.  Most data quality processing overwrites original content with new details.  Instead, keep the cleansed data and place the original values at the end of your table records. While this may require a bit more storage, it will dramatically simplify maintenance when rule changes occur in the future
  • Add source system identification and creation date/time details to every record.  While this may seem tedious and unnecessary, these two fields can dramatically simplify maintenance and trouble shooting in the future
  • Schedule a regular data change control meeting.  This too is similar in concept to the change control meeting associated with IT operations teams.  This is a forum for discussing data content issues and changes

Unfortunately, I often find that data maintenance is completely ignored. The problem is that fixing broken or inaccurate data isn’t sexy; developing a data maintenance plan isn’t always fun.   Most data warehouse development teams are buried with building new reports, loading new data, or supporting the ongoing ETL jobs; they haven’t given any attention to the quality or accuracy of the actual content they’re moving and reporting.   They simply don’t have the resources or time to address data maintenance as a proactive activity.

Business users clamor for new data and new reports; new funding is always tied to new business capabilities.  Support costs are budgeted, but they’re focused on software and hardware maintenance activities.  No one ever considers data maintenance; it’s simply ignored and forgotten.

Interesting that we view data as a corporate asset – a strategic corporate asset – and there’s universal agreement that hardware and software are simply tools to support enablement.  And where are we investing in maintenance?  The commodity tools, not the strategic corporate asset.

Photo courtesy of DesignzillasFlickr via Flickr (Creative Commons license).

The Problem with Total Cost of Ownership

Vintage_cash_register

The issue of Total Cost of Ownership (TCO) seems to come and go every few years. The need for it tends to ebb and flow with corporate budget cycles. TCO is perfectly fine for well-understood commodity functions or defined business processes. If I have to replace a server or a printer, or change a business process, TCO is a perfectly rational metric for comparing different alternatives.

When TCO calculations work, they tend to roll up within a single organization or manager. The hardware, the software, the installation, and the maintenance are under the domain of a single organization that covers the direct cost.

The problem with TCO arises when it’s used as a metric for justifying cross-functional or analytical systems. With these systems, the value isn’t delivering commodity processing but rather supporting decision making. TCO focuses on construction and maintenance costs. But for analytical systems, usage occurs across different organizations and varies with business value and need. TCO can in fact be misapplied.

At a simple level, TCO is often limited to processing hardware, storage, software, and IT resources necessary to configure and manage the platform on an ongoing basis.  But this is usually limited to IT staff focused on system development and maintenance. Unfortunately the most expensive cost—not normally included in TCO calculations—is the business user’s time. While TCO quantifies costs for a data warehouse developer, there is no clear way to calculate costs for the dozens or hundreds of business users who are actually analyzing data and creating reports every day. The reality of analytical systems is that development continues every day on the business side.

Nevertheless it’s common for TCO calculations to be reduced to the cost of processing or storage, rather than reflecting the exponential costs of users circumventing slow-running queries and inaccurate data.  At the end of the day, TCO shouldn’t only be about the cost of hardware and software installation and maintenance. It should be about the cost of continued business usage.

photo by -Luz- via Flickr (Creative Commons license)

Complex Event Processing: Challenging Real-Time ETL

Cave Swallow by Orin Zebest via Flickr (Creative Commons)

Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.

The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record?  Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?

I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.

When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.

There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer.  In the world of CEP we’d simply query the initial message and make the best offer.

Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?

The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.

photo by Orin Zebest via Flickr (Creative Commons license)

The Flaw of the Data Inventory

Grecian Urn 2

Back when I was applying to college, I’d read over college catalogs. Inevitably, each university would mention the number of books it had in its library. When I finally went to college, I realized that this metric was fairly meaningless. A dozen volumes on Grecian pottery did me no good when I was in search of a book on polymers for my mechanical engineering class.

Clients will often ask us to scope a “data inventory” project, inevitably focused on identifying and describing all the data elements contained across their different application systems. Recently a new CIO asked us to head up a “tiger team” to inventory his company’s data. He was surprised at the quantity of information needs that had been sent his way. As expected, he inquired about systems of record and data dictionaries. As you can imagine, he received multiple and conflicting answers which only exacerbated his confusion.

As a point of reference, well-known ERP systems can have in excess of 50,000 discrete data elements in their databases (never mind that some aren’t in English). As I’ve written in the past, many of these data elements have no use outside of the application itself.

Having terabyte upon terabyte of information is equally irrelevant if that data is unrelated to current business issues. The problem with a data inventory activity is that identifying and counting data elements in different systems and applications won’t necessarily solve any problems. Why? Because data across applications and packages is inconsistent: there are different names, definitions, and values, and there is no practical means of determining which data they actually have in common. This is like going to the hardware store and looking for a specific screw, but all the different screws are in one big barrel—you end up having to pick through each screw, one at time. When you find the screw, you just throw all the other screws back into the barrel.

The point of a data inventory isn’t to pick through data because it exists, but to inventory the data people actually need. If you’re going to undertake a data inventory, your output should be structured so that the next person doesn’t have to repeat your work.  Identify the data that is moving across various systems, as this indicates key information that’s being shared. Categorize this data by subject area. You’ll inevitably find that there are inconsistent versions of the data, enabling you to identify data disparities. You can then begin to develop a catalog of key corporate data that will form the basis of your data dictionary.

Inventorying the data that moves between systems accomplishes two things: it identifies the most valuable data elements in use, and it will also help identify data that’s not high-value, as it’s not being shared or used. This approach also provides a way to tackle initial data quality efforts by identifying the most “active” data used by the business. It ultimately helps the data management team understand where to focus its efforts, and prioritize accordingly.

So next time someone suggests a data inventory without context or objectives, consider sending them to college to study Grecian urns.

So You Think You’re Ready for a Data Warehouse Appliance, Part 2

Forklift by Bien Stephenson via Flickr (Creative Commons license)

As I wrote in last week’s blog post, a data warehouse appliance simplifies platform and system resource administration. It doesn’t simplify the traditional time-intensive efforts of managing and integrating disparate data and addressing performance and tuning of various applications that contend for the same resources.

Many data warehouse appliance vendors offer sophisticated parallel processing environments, query optimization, and specialized storage structures to improve query processing (e.g., columnar-based engines). It’s naïve to think that taking data from an SMP (Symmetric Multi-Processing) relational database and moving it into a parallel processing environment will effectively scale without any adjustments or changes. Moving onto an appliance can be likened to moving into a new house.  When you move into a new, larger house, you quickly learn that it’s not as simple as dumping all of your stuff into the new house.  The different dimensions of the new rooms cause you realize that some of your old furniture or rugs simple don’t fit.  You inevitably have to make adjustments if you want to truly enjoy your new home.  The same goes with a data warehouse appliance; it likely has numerous features to support growth and scalability; you have to make adjustments to leverage their benefits.

Companies that expect to simply dump their data from a few legacy data marts over to a new appliance should expect to confront some adjustments or their likely to experience some unpleasant surprises. Here are some that we’ve already seen.

Everyone agrees that the biggest cost issue behind building a data warehouse is ETL design and development. Hoping to migrate existing ETL jobs into a new hardware and processing environment without expecting rework is short-sighted.  While you can probably force fit your existing job streams, you’ll inevitably misuse the new system, waste system resources, and dramatically reduce the lifespan of the appliance. Each appliance has its own way of handling the intensive resource requirements of data loading – in much the same way that each incumbent database product addresses these same situations. If you’ve justified an appliance through the benefits of consolidating multiple data marts (that contain duplicate data), it only makes sense to consolidate and integrate the ETL processes to prevent processing duplication and waste.

To assume that because you’ve built your ETL architecture leveraging the latest and greatest ETL software technology that you won’t have to review the underlying ETL architecture is also misguided.  While there’s no question that migrating tool-based ETL jobs to a new platform can be much easier than lower-level code, the issue at hand isn’t the source and destination– it’s the underlying table structures.  Not every table will change in definition on a new platform, but the largest (and most used) table content is the most likely candidate for review and redesign.  Each appliance handles data distribution and database design differently. Consequently, since the underlying table structures are likely to require adjustment, plan on a redesign of the actual ETL process too.

I’m also surprised by the casual attitude regarding technical training.  After all, it’s just a SQL database, right? But application developers and data warehouse development staff need to understand the differences of the appliance product (after all, it’s a different database version or product).  While most of this knowledge can be gained through reading the manuals – when was the last time the DBAs or database developers actually had a full-set of manuals—much less the time required to read them?  The investment in training isn’t significant—usually just a few days of classes. If you’re going to provide your developers with a product that claims to bigger, better, and faster than its competitors, doesn’t it make sense to prepare them adequately to use it?

There’s also an assumption that—since most data warehouse appliance vendors are software-only—that there are no hardware implications. On the contrary, you should expect to change your existing hardware. The way memory and storage are configured on a data warehouse appliance can differ from a general-purpose server, but it’s still rare that the hardware costs are factored into the development plan. And believing that older servers can be re-purposed has turned out to be a myth.  If you ‘re attempting to support more storage, more processing, and more users, how can using older equipment (with the related higher maintenance costs) make financial sense?

You could certainly fork-lift your data, leave all the ETL jobs alone, and not change any processing.  Then again, you could save a fortune on a new data warehouse appliance and simply do nothing. After all, no one argues with the savings associated with doing nothing—except, of course, the users that need the data to run your business.

photo by Bien Stephenson via Flickr (Creative Commons License)

So You Think You’re Ready for a Data Warehouse Appliance

Toasters by meddygarnet via Flickr

Many of our clients have asked us about whether it’s time to consider replacing their aging data warehouses with data warehouse appliance technologies. I chock up this emerging interest to the reality that data warehouse life spans are 3 to 4 years and platforms need to be refreshed.   Given the recent crop of announcements by vendors like Oracle and Teradata along with the high visibility of newer players like Netezza, Paraccel, and Vertica.

The benefit of a data warehouse appliance is that includes all of the hardware and software in a preconfigured solution that dramatically simplifies running and managing a data warehouse. (Some of the vendors have taken that one step further and actually sell software that is setup to work with specially defined commodity hardware configurations). Given the price/performance differences between the established data warehouse products and the newer data warehouse appliances, it only makes sense that these products be considered as alternatives to simply upgrading the hardware.

The data warehouse appliance market is arguably not new. In the 1980s companies like Britton-Lee and Teradata argued that database processing was different and would perform better with purpose-designed hardware and software. Many have also forgotten these pioneers argued that the power of commodity microprocessors vastly exceeded the price/performance of their mainframe processor competitors.

The current-generation appliance vendors have been invited to the table because of the enormous costs that have evolved in managing the enormous data volumes and operational access associated with today’s data warehouses. Most IT shops have learned that database scalability doesn’t just mean throwing more hardware and storage at the problem.  The challenge in managing these larger environments is understand the dynamics of the data content and the associated processing.  That’s why partitioning the data across multiple servers or simply removing history doesn’t work – for every shortcut taken to reduce the data quantity, there’s an equal impact to user access and the single version of truth.  This approach also makes data manipulation and even system support dramatically more complicated.

It’s no surprise that these venture capital backed firms would focus on delivering a solution that was simpler to configure and manage.  The glossy sales message of data warehouse appliance vendors comes down’ to something like: “We’ve reduced the complexity of running a data warehouse..  Just install our appliance like a toaster, and watch it go!”   There’s no question that many of these appliance vendors have delivered when it comes to simplifying platform management and configuration; the real challenge is addressing the management and configuration issues that impact a growing data warehouse: scalable load processing, a flexible data architecture, and manageable query processing.

We’ve already run into several early-adopters that think all that is necessary is to simply fork-lift their existing data warehouse structures onto their new appliance.  While this approach may work initially, the actual longevity of the appliance – or its price/performance rationale will soon evaporate.  These new products can’t work around bad data, poor design habits, and the limitations of duplicate data; their power is providing scalability across enormous data and processing volumes.  An appliance removes the complexities of platform administration. But no matter what appliance you purchase, and no matter how much horsepower it has, data architecture and data administration are still required.

In order to leverage the true power of an appliance, you have to expect to focus effort towards integrating data in a structure that leverages the scalability strengths of the product. While the appliances are SQL-based, the way they process loads, organize data, and handle queries can be dramatically different than their incumbent data marts and data warehouses. It’s naïve to think that a new appliance can provide processing scalability without any adjustments.  If it was that simple, the incumbent vendors would have already packaged that in their existing products.


In Part 2 of this post, I’ll elaborate on the faulty assumptions of many companies that acquire data warehouse appliances, and warn you against making these same mistakes.


photo by meddygarnet via Flickr (Creative Commons License)

Good Data Warehouse DBAs are Hard to Find

Hands of Mechanic by Kerry 2009 via Flickr

As a consultant I’m often asked about how roles and responsibilities should be delegated or identified within the IT organization to support the data warehousing.  One role that seems to incite discussion is the role of the database administrator (DBA). Many of our clients pack a series of roles and responsibilities into a laundry list of heterogeneous tasks. However I usually recommend delineating DBA resources and assigning individuals to either transactional systems or data warehousing, an approach that tends to surprise my clients.  Let me explain why I see the roles across these different systems requiring entirely different skills.

DBAs are typically focused on the care and feeding of the DBMS system to ensure that processing is consistent and performance is maintained regardless of the circumstances.  The DBA is typically responsible for establishing table structures, configuring database systems, and designing queries and execution jobs to efficiently utilize the system.  Users don’t like surprises; they want daily activities to complete a specific time every day. Their jobs depend on timely information.  Many IT managers assume that if a DBA understands how a particular DBMS works, that DBA can address both transactional and analytical responsibilities.  It’s a risky assumption.

Assuming that because someone can design and manage a transactional database environment, they are qualified to design and manage an analytical system is flawed. The details associated with designing or troubleshooting a sophisticated transactional system have little in common with an enterprise, cross-functional data warehouse.  

Just because someone is a mechanic doesn’t mean they can fix or repair any type of car.  The mechanic who can fix  a diesel pickup is unlikely to be able to repair the engine of an 18-wheeler. While the basic skills are the same, the actual situations and experience required to solve specific problems are dramatically different.  Just because someone is a DBA doesn’t mean they can design or support any type of application/database system.

The design of a transactional application is typically preceded by detailed transaction and data specifications.  Because a transactional system supports specific business processes, the actual transactions, data details, and processing volumes are well understood prior to development.  Most queries are single statement queries accessing individual records within a single table.  It’s critical that the workload be well defined because of the enormous costs associated with these applications.  Consistent response time is critical.  Managing a system requires attention to transaction quantities, query plans, and data volumes to ensure that data and processing is distributed across the system’s resources.  Users are often grouped by specific application (or privileges) and while processing can vary across different applications, users are usually homogeneous.  System growth occurs with more users and the additional transaction volumes.   While ad-hoc processing or table joins are technically feasible, they are rarely supported. 

Contrast that with analytical systems. These DBAs have an entirely different set of challenges when undertaking development.  Database design is often undertaken with the knowledge that the content of the database will change.  It’s not uncommon for BI systems to start by supporting a single subject area only to grow exponentially in size due to the growth of additional subject areas and data volumes. The DBA designs data structures based on current and future data content needs and must also address the divergent processing needs of data loading and complex query processing.

Managing an analytical system also differs because of variety of user processing.  It’s not uncommon for a data warehouse to support numerous canned reports or queries along with a category of power users generating ad-hoc queries.  The challenge is preventing a single ad-hoc query from crippling the processing of the entire system.  Multi-statement queries, numerous table joins, and large volumes of historical content are commonplace in analytical environments. 

It becomes fairly clear that the role of a DBA is very different when comparing the work activities of analytical and operational systems.  I’m not suggesting that working in one environment is more complex or difficult than the other—they’re just different.  Thus the activities and their associated skills are very different.  Which is why we often recommend that a single individual may be hard-pressed to support both operational and analytical environments.

Can one person address both responsibilities? Maybe. But first, try contacting your diesel mechanic and see if he’s interested in becoming your operational system DBA.

photo by Kerry 2009 via Flickr.

The Rise of the Columnar Database

Column_eflon
photo by eflon

I’m continually surprised that more vendors haven’t hurled themselves onto the columnar database bandwagon. The more this space matures, the more evident it becomes that analytics is a perfect match for column-based database architectures.

One of the most frustrating phenomena to IT is adherence to a theoretical view. In the 1970s the entire relational database industry implemented what was really an academic precept. For those pragmatists who haven’t dusted off their textbooks recently, I’ll recall the writings of Codd and Date. They introduced the concepts of organizing data in tuples, organizing primary values along with their descriptive details (aka: attributes). Vendors interpreted this to mean that data should be physically stored in this fashion, architecting their products to store data in tables, populated with rows consisting of columns. If you wanted to access a value, you had to retrieve the entire row.

With all due respect, this approach has been cumbersome since Day 1. The fact is, storing data the way the business looks doesn’t lend itself to the way people ask questions. When I create an outbound marketing list, I need a name, a phone number, and an address. I don’t need information on household, demographic segment, or the name of a customer’s dog.

While I do need to store all the customer data, I don’t want to be bogged down by processing all that data in order to answer my question. Herein lies the quandary: do I structure the data based on all the information we have, or based on the information I might access?

Vendors have tried to bridge the gap. We’ve seen partitioning, star indexes, query pre-processing, bitmap and index joins, and even hashing in an attempt to support more specific data retrieval. Such solutions still require examining the contents of the entire row.

Although my background is in engineering, I know enough about Occam’s razor to know that it applies here: the simplest solution is the best one. Vendors like Kickfire, Vertica, Paraccel, and Sybase—whose pioneering IQ product launched over a dozen years ago–went back to the drawing board and fixed the problem, architecting their products structure and store the data the way people ask questions—in columns.

For you SQL jockeys, most of the heavy-lifting in database processing is in the where clause. Columnar databases are faster because their processing isn’t inhibited by unnecessary row content. Because many database tables can have upwards of 100 columns, and because most business questions only request a handful of them, this just makes business sense. And In these days of multi-billion row tables and petabyte-sized systems, columnar databases make more sense than ever.

As the data warehouse market continues to consolidate through acquisitions, look for column-based startups—including several open-source solutions—to fill the void. If you ask me, there’s plenty of room.

%d bloggers like this: