Improving Data Integration the Old Fashioned Way

IT organizations have spent enormous sums of money over the past 10-15 years attacking productivity.  They’ve acquiring data integration tools, implemented improved development methodologies, and even reengineered requirements gathering methods to ensure business priority alignment. And the result of all of this investment?  Today’s data integration developers are easily 10x to 20x more productive than the COBOL programmers of the past. This shouldn’t be a surprise to anyone – writing, compiling, linking, and testing 3rd generation code is much slower than today’s GUI-based, drag-and-drop development tools.   The tools work; developers are faster, quicker, and better.

So, why does it still seem to take an eternity and cost a fortune to acquire and integrate new data into an existing report?   The bottleneck has moved upstream: finding and extracting source data is complicated and time consuming.  We’ve invested in our Integration Competency Centers to create an assembly line to streamline the process of transforming and converting data that is loaded into databases or applications.  Unfortunately, we’ve not devoted any effort in simplifying access or understanding the actual raw source data that feeds the assembly line.

Henry Ford didn’t invent the assembly line, he revolutionized it. One of the changes that he introduced to the assembly line was simplifying and standardizing parts and the actual assembly process. Prior to Ford’s assembly line, car assembly was a custom effort that required highly trained craftsmen to shape, tool, and fit parts by hand (in a very time consuming process). The parts weren’t always uniform, so the craftsmen had to spend a significant amount of time fitting the parts together.

In most IT environments, source system access and data content varies across the different application systems dramatically.  This forces developers to become data craftsmen in order to deal with the data idiosyncrasies associated with the numerous source systems common to most companies. Every system stores data in a custom and unique manner; it takes a lot of time to search and analyze source system data in order to identify the necessary content.  (A popular ERP package stores its details in more than 10,000 tables) So, each new request often requires developers to create “from scratch” code to access and manipulate new data from a source system. If you dig a bit, you’ll probably find that many of your application systems generate dozens or hundreds (yes, hundreds) of custom extracts to deliver data to support the various production business needs within your company.

While most folks might think that custom extracts are a reasonably decent solution, they’re not.  In fact, they’re a problem that will only get worse with time.  (Remember, every extract requires development time and ongoing support.)  You’ll be better off consolidating all of those extracts into a single set that includes all of the data.  This will reduce processing time, reduce storage, reduce maintenance, and ultimately save a lot of money. You’ll have to spend some time designing and building these new extracts and getting folks to migrate to using them, but the benefits will be significant. (One of my clients was able to defer a platform upgrade due to the CPU and storage reduction caused by the consolidation and removal of all of the custom extracts).

Standardizing source data to reduce the data craftsmen problem isn’t rocket science, but it’s more than simply creating a data dump or generating a backup file.  You need to deliver data in a manner that can be quickly and easily consumed by other systems.  This means that the content needs to be reformatted from the unique (sometimes indecipherable) format of the host application into a format that everyone else can use. This can be easily addressed by delivering data into database tables or flat files (I know one client that delivers data in tab delimited spreadsheet format).  The data should reflect the values generated by the source system in a format that everyone can understand – the content shouldn’t be modified for cleansed (this is source data, not content ready for business consumption). Delivery should occur in a frequent and regular basis along with a plan for archiving a decent amount of history.

This isn’t a new concept; this was a common approach in the days when custom coded IBM mainframe applications were all the rage. Back then, data sharing was a priority and every application generated standard extracts to reduce I/O and storage costs.  There was also an extreme sensitivity to developer time.  Requesting a custom extract was frowned upon and rarely approved.  Finding and accessing the data was as simple as referencing the extract files that were made available from every application system.

When it comes to improving the delivery speed of new data to business users, maybe we can learn something from Henry Ford and the world of mainframe development.

Advertisements

Data Governance: Managing Data as an Asset

 

20121029 GoldBullionI always find it interesting when people pile onto the company’s latest and most popular project or initiative. People love to gravitate to whatever is new and sexy within the company, regardless of what they’re working on or their current responsibilities. There never seems to be a shortage of the “bright shiny object” syndrome – you know, organizational ADHD.  This desire to jump on the band wagon often positions individuals with limited experience to own and drive activities they don’t fully understand. The world of data governance is rife with supporters and promoters that are thrilled to be involved, but a bit unprepared to participate and execute.  It’s like loading a gun and pulling the trigger before aiming – you’ll make a lot of noise and likely miss the target.  If only folks spent a bit of time educating others about the meaning and purpose of data governance before they got started.

Let me first offer up some definitions from a few reputable sources…

“Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise” (Wikipedia)

“The process by which an organization formalizes the ‘fiduciary duty’ for the management of data assets”  (Forrester Research)

 “…the overall management of the availability, usability, integrity, and security of the data employed in an enterprise” (TechTarget)

For those of you that have experience with data governance, the above definitions are unlikely to be much of a surprise.  For the other 99%, there’s likely to be some head scratching. I actually think most folks that haven’t been indoctrinated to the religion of data have just assumed that data governance is simply a new incarnation of yesterday’s data quality or metadata discussion.  That probably shouldn’t be much of a surprise; the discussion of data inaccuracy and data dictionaries has gotten so much air time over the past 30 years, the typical business user probably feels brainwashed when they hear anything with “data” in the title.  I actually think that Data Governance may win the prize for being among the most misunderstood concepts within Information Technology.

Data governance is a very simple concept.  Data Governance is about establishing the processes for accessing and sharing data and resolving conflict when the processes don’t work.

A Data Governance initiative is really about instilling the concept of managing data as a corporate asset. Companies have standard methods and processes for asset management: your Procurement group has a slew of rules and processes to support the purchasing of office supplies; the HR organization has rules and guidelines for hiring and managing staff; and the finance organization follows “generally accepted accounting principles” to handle managing the company’s fixed and financial assets.  Unfortunately, what we don’t have is a set of generally accepted principles for data. This is what data governance establishes.

The reason that you see the term process in nearly every definition of data governance is that until you establish and standardize data related processes, you’ll never get any of the work done. Getting started with data governance isn’t about establishing a committee – it’s about identifying the goals and identifying the policies and processes that will direct the work activities. You can’t be successful in managing an asset if everyone has their own rules and methods for accessing, manipulating, and using the asset.  This isn’t rocket science – geez – the world of ERP implementations and even business reengineering projects learned this concept more than 10 years ago.

The reason to manage data as a corporate asset is to ensure that business activities that require data are able to use and access data in a simple, uniform, consistent manner.  Unfortunately, in the era of search engines, content indexing, data warehouses, and the Cloud,  finding and acquiring data to support a new business need can be painful, time consuming, and expensive.  Everyone has their own terms, their own private data stash, and their own rules dictating who is and isn’t allowed to access data.  This isn’t corporate asset management– this is corporate asset chaos.  A data governance initiative is one of the best ways to get started in managing data as a corporate asset.

The Time Has Come for Enterprise Search

Man-climbing-papers1
Maybe it’s time to challenge the 20 year-old paradigm of making everyone a knowledge worker. For a long time the BI community has assumed that if we give business users the right data and tools, they’ll have the necessary ammunition to do their jobs. But I’m beginning to believe that may no longer be a practical approach. At least not for everyone.

One thing that’s changed in the last dozen-or-so years is that individuals’ job responsibilities have become more complex. The breadth of these responsibilities has grown. I question whether the average business user can really keep track of all the subject area content, all the table definitions, column names, data types, definitions of columns, and locations of all the values across the 6000+ tables in the data mart.

And that’s just the data mart. I’m not even including the applications and systems the average business user interacts with on a daily basis. Not to mention all those presentations, documents, videos, and archived e-mails from customers.

I’m not arguing the value of analytics, nor am I challenging the value of the data warehouse. But is it really practical to expect everyone to generate their own reports? Look at the U.S. tax code. It’s certainly broader than a single CPA can keep track of. Now consider most companies’ Finance departments. There’s more data coming out of Finance than most people can deal with. Otherwise all those specialized applications and dedicated data analysts wouldn’t exist in the first place!

Maybe it’s not about delivering BI tools to every end-user. Maybe it’s about delivering reports in a manner that can be consumed. We’ve gotten so wound-up about detailed data that we haven’t stopped to wonder whether it’s worthwhile to push all that detail to the end-user’s desktop—and then expect him or her to learn all the rules.

One of my brokerage accounts contains 5 different equities. I don’t look at them every day. I don’t look at intra-day price changes. I really don’t need to know. All I really want to know is when I do look at the information, has the stock’s value gone up or down? And how do I get the information? I didn’t build a custom report. I didn’t do drill-down, or drill-across. I went to the web and searched on the stock price.

Maybe instead of buying of a copy of a [name the BI vendor software] tool, we simply build a set of standard reports for key business areas (Sales, Marketing, Finance), and publish them. You can publish these reports to a drive, to a server, to a website, to a portal—it shouldn’t matter. People should find the information with a browser. Reports can be stored and indexed and accessed via an enterprise search engine. Of course, as with everything else, you still need to define terms and metadata so that people understand what they’re reading.

Whenever people talk about enterprise search functionality they’re usually obsessing about unstructured data. But enterprise search can deliver enormous value for structured data. IT departments could be leading the charge if the definition of success weren’t large infrastructure and technology implementation projects and instead data delivery and usage.

The executive doesn’t ask, “What tool did you use to solve this problem?” Instead, she wants to know if the problem has in fact been solved.

The Problem with Total Cost of Ownership

Vintage_cash_register

The issue of Total Cost of Ownership (TCO) seems to come and go every few years. The need for it tends to ebb and flow with corporate budget cycles. TCO is perfectly fine for well-understood commodity functions or defined business processes. If I have to replace a server or a printer, or change a business process, TCO is a perfectly rational metric for comparing different alternatives.

When TCO calculations work, they tend to roll up within a single organization or manager. The hardware, the software, the installation, and the maintenance are under the domain of a single organization that covers the direct cost.

The problem with TCO arises when it’s used as a metric for justifying cross-functional or analytical systems. With these systems, the value isn’t delivering commodity processing but rather supporting decision making. TCO focuses on construction and maintenance costs. But for analytical systems, usage occurs across different organizations and varies with business value and need. TCO can in fact be misapplied.

At a simple level, TCO is often limited to processing hardware, storage, software, and IT resources necessary to configure and manage the platform on an ongoing basis.  But this is usually limited to IT staff focused on system development and maintenance. Unfortunately the most expensive cost—not normally included in TCO calculations—is the business user’s time. While TCO quantifies costs for a data warehouse developer, there is no clear way to calculate costs for the dozens or hundreds of business users who are actually analyzing data and creating reports every day. The reality of analytical systems is that development continues every day on the business side.

Nevertheless it’s common for TCO calculations to be reduced to the cost of processing or storage, rather than reflecting the exponential costs of users circumventing slow-running queries and inaccurate data.  At the end of the day, TCO shouldn’t only be about the cost of hardware and software installation and maintenance. It should be about the cost of continued business usage.

photo by -Luz- via Flickr (Creative Commons license)

Complex Event Processing: Challenging Real-Time ETL

Cave Swallow by Orin Zebest via Flickr (Creative Commons)

Unless you’ve been hiding in a cave in the past year, you’ve probably heard of CEP (Complex Event Processing) or data stream analysis. Because a lot of real-time analysis focuses on discrete data elements rather than data sets, this technology allows users to query and manipulate discrete pieces of information, like events and messages, in real-time—without being encumbered by a traditional database management system.

The analogy here is that if you can’t bring Mohammed to the mountain, bring the mountain to Mohammed: why bother loading data into a database with a bunch of other records when I only need to manipulate a single record?  Furthermore, this lets me analyze the data right after its time of creation! Since one of the biggest obstacles to query performance is disk I/O, why not bypass the I/O problem altogether?

I’m not challenging data warehousing and historical analysis. But the time has come to apply complex analytics and data manipulation against discrete records more efficiently. Some of the more common applications of this technology include fraud/transaction approval, event pattern recognition, and brokerage trading systems.

When it comes to ETL (Extract, Transform, and Load) processing, particularly in a real-time or so-called “trickle-feed” environment, CEP may actually provide a better approach to traditional ETL. CEP provides complex data manipulation directly against the individual record. There is no intermediary database. The architecture is inherently storage-efficient: if a second, third, or fourth application needs access to a particular data element, it doesn’t get its own copy. Instead, each application applies its own process. This prevents the unnecessary or reckless copying of source application content.

There are many industries need a real-time view of customer activities. For instance in the gaming industry when a customer inserts her card into a slot machine, the casino wants to provide a custom offer. Using traditional data warehouse technology, a significant amount of processing is required to capture the data, to transform and standardize it, to load it into a table, only to make it available to a query to identify the best offer.  In the world of CEP we’d simply query the initial message and make the best offer.

Many ETL tools already use query language constructs and operators to manipulate data. They typically require the data to be loaded into a database. The major vendors have evolved to an “ELT” architecture: to leverage the underlying database engine to address performance. Why not simply tackle the performance problem directly and bypass the database altogether?

The promise of CEP a new set of business applications and capabilities. I’m also starting to believe that CEP could actually replace traditional ETL tools as a higher performance and easier-to-use alternative. The interesting part will be seeing how long before companies emerge from their caves and adopt it.

photo by Orin Zebest via Flickr (Creative Commons license)

The Flaw of the Data Inventory

Grecian Urn 2

Back when I was applying to college, I’d read over college catalogs. Inevitably, each university would mention the number of books it had in its library. When I finally went to college, I realized that this metric was fairly meaningless. A dozen volumes on Grecian pottery did me no good when I was in search of a book on polymers for my mechanical engineering class.

Clients will often ask us to scope a “data inventory” project, inevitably focused on identifying and describing all the data elements contained across their different application systems. Recently a new CIO asked us to head up a “tiger team” to inventory his company’s data. He was surprised at the quantity of information needs that had been sent his way. As expected, he inquired about systems of record and data dictionaries. As you can imagine, he received multiple and conflicting answers which only exacerbated his confusion.

As a point of reference, well-known ERP systems can have in excess of 50,000 discrete data elements in their databases (never mind that some aren’t in English). As I’ve written in the past, many of these data elements have no use outside of the application itself.

Having terabyte upon terabyte of information is equally irrelevant if that data is unrelated to current business issues. The problem with a data inventory activity is that identifying and counting data elements in different systems and applications won’t necessarily solve any problems. Why? Because data across applications and packages is inconsistent: there are different names, definitions, and values, and there is no practical means of determining which data they actually have in common. This is like going to the hardware store and looking for a specific screw, but all the different screws are in one big barrel—you end up having to pick through each screw, one at time. When you find the screw, you just throw all the other screws back into the barrel.

The point of a data inventory isn’t to pick through data because it exists, but to inventory the data people actually need. If you’re going to undertake a data inventory, your output should be structured so that the next person doesn’t have to repeat your work.  Identify the data that is moving across various systems, as this indicates key information that’s being shared. Categorize this data by subject area. You’ll inevitably find that there are inconsistent versions of the data, enabling you to identify data disparities. You can then begin to develop a catalog of key corporate data that will form the basis of your data dictionary.

Inventorying the data that moves between systems accomplishes two things: it identifies the most valuable data elements in use, and it will also help identify data that’s not high-value, as it’s not being shared or used. This approach also provides a way to tackle initial data quality efforts by identifying the most “active” data used by the business. It ultimately helps the data management team understand where to focus its efforts, and prioritize accordingly.

So next time someone suggests a data inventory without context or objectives, consider sending them to college to study Grecian urns.

%d bloggers like this: