Is Your Analytics Data Single Use or Multipurpose?
I just finished reading an article on data pipelines and how this approach to accessing and sharing data will improve and simplify data access for analytics developers and users. The key tenets of the data pipeline approach include simplifying data access by ensuring that pipelines are visible and reusable, and delivering data that is discoverable, shareable, and usable. The article covered the details of placing the data on a central platform to make it available, using open source utilities to simplify construction, transforming the data to make the data usable, and cataloging the data to make it discoverable. The idea is that data should be multipurpose, not single use. Building reusable code that delivers source data sets that are easily identified and used has been around since the 1960’s. It’s a great idea and even simpler now with today’s technologies and methods than it was 50+ years ago.
The idea of reusable components is a concept that has been in place in the automobile industry for many years. Why create custom nuts, bolts, radios, engines, and transmissions if the function they provide isn’t unique and doesn’t differentiate the overall product? That’s why GM, Ford and others have standard parts that are used across their numerous products. The parts, their capabilities, and specifications are documented and easily referenceable to ensure they are used as much as possible. They have lots of custom parts too; those are the ones that differentiate the individual products (exterior body panels, bumpers, windshields, seats, etc.) Designing products that maximize the use of standard parts dramatically reduces the cost and expedites delivery. Knowing which parts to standardize is based on identifying common functions (and needs) across products.
It’s fairly common for an analytics team to be self-contained and focused on an individual set of business needs. The team builds software to ingest, process, and load data into a database to suit their specific requirements. While there might be hundreds of data elements that are processed, only those elements specific to the business purpose will be checked for accuracy and fixed. There’s no attention to delivering data that can be used by other project teams, because the team isn’t measured or rewarded on sharing data; they’re measured against a specific set of business value criteria (functionality, delivery time, cost, etc.)
This creates the situation where multiple development teams ingest, process, and load data from the same sources for their individual projects. They all function independently and aren’t aware of the other teams’ activities. I worked with a client that had 14 different development teams each loading data from the same source system. They didn’t know what each other was doing nor were they aware that there was any overlap. While data pipelining technology may have helped this client, the real challenge wasn’t tooling, it was the lack of a methodology focused on sharing and reuse. Every data development effort was a custom endeavor; there was no economies-of-scale or reuse. Each project team built single use data, not multipurpose data that could be shared and reused.
The approach to using standard and reusable parts requires a long-term view of product development costs. The initial cost for building standard components is expensive, but it’s justified in reduced delivery costs through reuse in future projects. The key is understanding which components should be built for reuse and which parts are unique and are necessary for differentiation. Any organization that takes this approach invests in staff resources that focus on identifying standard components and reviewing designs to ensure the maximum use of standard parts. Success is also dependent on communicating across the numerous teams to ensure they are aware of the latest standard parts, methods, and practices.
The building of reusable code and reusable data requires a long-term view and an understanding of the processing functions and data that can be shared across projects. This approach isn’t dependent on specific tooling; it’s about having the development methods and staff focused on ensuring that reuse is a mandatory requirement. Data Pipelining is indeed a powerful approach; however, without the necessary development methods and practices, the creation of reusable code and data won’t occur.
There’s nearly universal agreement within most companies that all development efforts should generate reusable artifacts. Unfortunately, the reality is that this concept gets more lip service than attention. While most companies have lots of tools available to support the sharing of code and data, few companies invest in their staff members to support such techniques. It’s rare that I’ve seen any organization identify staff members that are tasked with establishing data standards and require the review of development artifacts to ensure the sharing and reuse of code and data. Even fewer organizations have the data development methods that ensure collaboration and sharing occurs across teams. Everyone has collaboration tools, but the methods and practices to utilize them to support reuse isn’t promoted (and often doesn’t even exist).
The automobile industry learned that building cars in a custom manner wasn’t cost effective; using standard parts became a necessity. While most business and technology executives agree that reusable code and shared data is a necessity, few realize that their analytics teams address each data project in a custom, build-from-scratch manner. I wonder if the executives responsible for data and analytics have ever considered measuring (or analyzing) how much data reuse actually occurs?
Data Sharing is a Production Need
The concept of “Production” in the area of Information Technology is well understood. It means something (usually an application or system) is ready to support business processing in a reliable manner. Production environments undergo thorough testing to ensure that there’s minimal likelihood of a circumstance where business activities are affected. The Production label isn’t thrown around recklessly; if a system is characterized as Production, there are lots of business people dependent on those systems to get their job done.
In order to support Production, most IT organizations have devoted resources focused solely on maintaining Production systems to ensure that any problem is addressed quickly. When user applications are characterized as Production, there’s special processes (and manpower) in place to address installation, training, setup, and ongoing support. Production systems are business critical to a company.
One of the challenges in the world of data is that most IT organizations view their managed assets as storage, systems, and applications. Data is treated not as an asset, but as a byproduct of an application. Data storage is managed based on application needs (online storage, archival, backup, etc.) and data sharing is handled as a one-off activity. This might have made sense in the 70’s and 80’s when most systems were vendor specific and sharing data was rare; however, in today’s world of analytics and data-driven decision making, data sharing has become a necessity. We know that every time data is created, there are likely 10-12 business activities requiring access to that data.
Data sharing is a production business need.
Unfortunately, the concept of data sharing in most companies is a handled as a one-off, custom event. Getting a copy of data often requires tribal knowledge, relationships, and a personal request. While there’s no arguing that many companies have data warehouses (or data marts, data lakes, etc.), adding new data to those systems is where I’m focused. Adding new data or integrating 3rd party content into a report takes a long time because data sharing is always an afterthought.
Think I’m exaggerating or incorrect? Ask yourself the following questions…
- Is there a documented list of data sources, their content, and a description of the content at your company?
- Do your source systems generate standard extracts, or do they generate 100s (or 1000’s) of nightly files that have been custom built to support data sharing?
- How long does it take to get a copy of data (that isn’t already loaded on the data warehouse)?
- Is there anyone to contact if you want to get a new copy of data?
- Is anyone responsible for ensuring that the data feeds (or extracts) that currently exist are monitored and maintained?
While most IT organizations have focused their code development efforts on reuse, economies-of-scale, and reliability, they haven’t focused their data development efforts in that manner. And one of the most visible challenges is that many IT organizations don’t have a funding model to support data development and data sharing as a separate discipline. They’re focused on building and delivering applications, not building and delivering data. Supporting data sharing as a production business need means adjusting IT responsibilities and priorities to reflect data sharing as a responsibility. This means making sure there are standard extracts (or data interfaces) that everyone can access, data catalogs available containing source system information, and staff resources devoted to sharing and supporting data in a scalable, reliable, and cost-efficient manner. It’s about having an efficient data supply chain to share data within your company. It’s because data sharing is a production business need.
Or, you could continue building everything in a one-off custom manner.
Shadow IT: Déjà Vu All Over Again
I’m a bit surprised with all of the recent discussion and debate about Shadow IT. For those of you not familiar with the term, Shadow IT refers to software development and data processing activities that occur within business unit organizations without the blessing of the Central IT organization. The idea of individual business organizations purchasing technology, hiring staff members, and taking on software development to address specific business priorities isn’t a new concept; it’s been around for 30 years.
When it comes to the introduction of technology to address or improve business process, communications, or decision making, Central IT has traditionally not been the starting point. It’s almost always been the business organization. Central IT has never been in the position of reengineering business processes or insisting that business users adopt new technologies; that’s always been the role of business management. Central IT is in the business of automating defined business processes and reducing technology costs (through the use of standard tools, economies-of-scale methods, commodity technologies). It’s not as though Shadow IT came into existence to usurp the authority or responsibilities of the IT organization. Shadow IT came into existence to address new, specialized business needs that the Central IT organization was not responsible for addressing.
Here’s a few examples of information technologies that were introduced and managed by Shadow IT organizations to address specialized departmental needs.
- Word Processing. Possibly the first “end user system” (Wang, IBM DisplayWrite, etc.) This solution was revolutionary in reducing the cost of documentation
- The minicomputer. This technology revolution of the 70’s and 80’s delivered packaged, departmental application systems (DEC, Data General, Prime, etc.) The most popular were HR, accounting, and manufacturing applications.
- The personal computer. Many companies created PC support teams (in Finance) because they required unique skills that didn’t exist within most companies.
- Email, File Servers, and Ethernet (remember Banyan, Novell, 3com). These tools worked outside the mainframe OLTP environment and required specialized skills.
- Data Marts and Data Warehouses. Unless you purchased a product from IBM, the early products were often purchased and managed by marketing and finance.
- Business Intelligence tools. Many companies still manage analytics and report development outside of Central IT.
- CRM and ERP systems. While both of these packages required Central IT hardware platforms, the actual application systems are often supported by separate teams positioned within their respective business areas.
The success of Shadow IT is based on their ability to respond to specialized business needs with innovative solutions. The technologies above were all introduced to address specific departmental needs; they evolved to deliver more generalized capabilities that could be valued by the larger corporate audience. The larger audience required the technology’s ownership and support to migrate from the Shadow IT organization to Central IT. Unfortunately, most companies were ill prepared to support the transition of technology between the two different technology teams.
Most Central IT teams bristle at the idea of inheriting a Shadow IT project. There are significant costs associated with transitioning a project to a different team and a larger user audience. This is why many Central IT teams push for Shadow IT to adopt their standard tools and methods (or for the outright dissolution of Shadow IT). Unfortunately applying low-cost, standardized methods to deploy and support a specialized, high-value solution doesn’t work (if it did, it would have been used in the first place). You can’t expect to solve specialized needs with a one-size-fits-all approach.
A Shadow IT team delivers dozens of specialized solutions to their business user audience; the likelihood that any solution will be deployed to a larger audience is very small. While it’s certainly feasible to modify the charter, responsibilities, and success metrics of a Centralized IT organization to support both specialized unique and generalized high volume needs, I think there’s a better alternative: establish a set of methods and practices to address the infrequent transition of Shadow IT projects to Central IT. Both organizations should be obligated to work with and respond to the needs and responsibilities of the other technology team.
Most companies have multiple organizations with specific roles to address a variety of different activities. And organizations are expected to cooperate and work together to support the needs of the company. Why is it unrealistic to have Central IT and Shadow IT organizations with different roles to address the variety of (common and specialized) needs across a company?
My Dog Ate the Requirements, Part 2
There’s nothing more frustrating than not being able to rely upon a business partner. There’s lots of business books about information technology that espouses the importance of Business/IT alignment and the importance of establishing business users as IT stakeholders. The whole idea of delivering business value with data and analytics is to provide business users with tools and data that can support business decision making. It’s incredibly hard to deliver business value when half of the partnership isn’t stepping up to their responsibilities.
There’s never a shortage of rationale as to why requirements haven’t been collected or recorded. In order for a relationship to be successful, both parties have to participate and cooperate. Gathering and recording requirements isn’t possible if the technologist doesn’t meet with the users to discuss their needs, pains, and priorities. Conversely, the requirements process won’t succeed if the users won’t participate. My last blog reviewed the excuses that technologists offered for explaining the lack of documented requirements; this week’s blog focuses on remarks I’ve heard from business stakeholders.
- “I’m too busy. I don’t have time to talk to developers”
- “I meet with IT every month, they should know my requirements”
- “IT isn’t asking me for requirements, they want me to approve SQL”
- “We sent an email with a list of questions. What else do they need?”
- “They have copies of reports we create. That should be enough.”
- “The IT staff has worked here longer than I have. There’s nothing I can tell them that they don’t already know”
- “I’ve discussed my reporting needs in 3 separate meetings; I seem to be educating someone else with each successive discussion”
- “I seem to answer a lot of questions. I don’t ever see anyone writing anything down”
- “I’ll meet with them again when they deliver the requirements I identified in our last discussion.
- “I’m not going to sign off on the requirements because my business priorities might change – and I’ll need to change the requirements.
Requirements gathering is really a beginning stage for negotiating a contract for the creation and delivery of new software. The contract is closed (or agreed to) when the business stakeholders agree to (or sign-off on) the requirements document. While many believe that requirements are an IT-only artifact, they’re really a tool to establish responsibilities of both parties in the relationship.
A requirements document defines the data, functions, and capabilities that the technologist needs to build to deliver business value. The requirements document also establishes the “product” that will be deployed and used by the business stakeholders to support their business decision making activities. The requirements process holds both parties accountable: technologists to build and business stakeholders to use. When two organizations can’t work together to develop requirements, it’s often a reflection of a bigger problem.
It’s not fair for business stakeholders to expect development teams to build commercial grade software if there’s no participation in the requirements process. By the same token, it’s not right for technologists to build software without business stakeholder participation. If one stakeholder doesn’t want to participate in the requirements process, they shouldn’t be allowed to offer an opinion about the resulting deliverable. If multiple stakeholders don’t want to participate in a requirements activity, the development process should be cancelled. Lack of business stakeholder participation means they have other priorities; the technologists should take a hint and work on their other priorities.
Advanced Data Virtualization Capabilities
In one of my previous blogs, I wrote about Data Virtualization technology — one of the more interesting pieces of middleware technology that can simplify data management. While most of the commercial products in this space share a common set of features and functions, I thought I’d devote this blog to discussing the more advanced features. There are quite a few competing products; the real challenge in differentiating the products is to understand their more advanced features.
The attraction of data virtualization is that it simplifies data access. Most IT shops have one of everything – and this includes several different brands of commercial DBMSs, a few open source databases, a slew of BI/reporting tools, and the inevitable list of emerging and specialized tools and technologies (Hadoop, Dremel, Casandra, etc.) Supporting all of the client-to-server-to-repository interfaces (and the associated configurations) is both complex and time consuming. This is why the advanced capabilities of Data Virtualization have become so valuable to the IT world.
The following details aren’t arranged in any particular order. I’ve identified the ones that I’ve found to be the most valuable (and interesting). Let me also acknowledge not every DV product supports all of these features.
Intelligent data caching. Repository-to-DV Server data movement is the biggest obstacle in query response time. Most DV products are able to support static caching to reduce repetitive data movement (data is copied and persisted in the DV Server). Unfortunately, this approach has limited success when there are ad hoc users accessing dozens of sources and thousands of tables. The more effective solution is for the DV Server to monitor all queries and dynamically cache data based on user access, query load, and table (and data) access frequency.
Query optimization (w/multi-platform execution). While all DV products claim some amount of query optimization, it’s important to know the details. There are lots of tricks and techniques; however, look for optimization that understands source data volumes, data distribution, data movement latency, and is able to process data on any source platform.
Support for multiple client Interfaces. Since most companies have multiple database products, it can be cumbersome to support and maintain multiple client access configurations. The DV server can act as a single access point for multiple vendor products (a single ODBC interface can replace drivers for each DBMS brand). Additionally, most DV Server drivers support multiple different access methods (ODBC, JDBC, XML, and web services).
Attribute level or value specific data security. This feature supports data security at a much lower granularity than is typically available with most DBMS products. Data can be protected (or restricted) at individual column values for entire table or selective rows.
Metadata tracking and management. Since Data Virtualization is a query-centric middleware environment, it only makes sense to position this server to retrieve, reconcile, and store metadata content from multiple, disparate data repositories.
Data lineage. This item works in tandem with the metadata capability and augments the information by retaining the source details for all data that is retrieved. This not only includes source id information for individual records but also the origin, creation date, and native attribute details.
Query tracking for usage audit. Because the DV Server can act as a centralized access point for user tool access, there are several DV products that support the capture and tracking of all submitted queries. This can be used to track, measure, and analyze end user (or repository) access.
Workflow linkage and processing. This is the ability to execute predefined logic against specific data that is retrieved. While this concept is similar to a macro or stored procedure, it’s much more sophisticated. It could include the ability to direct job control or specialized processing against an answer set prior to delivery (e.g. data hygiene, external access control, stewardship approval, etc.)
Packaged Application Templates. Most packaged applications (CRM, ERP, etc.) contain thousands of tables and columns that can be very difficult to understand and query. Several DV vendors have developed templates containing predefined DV server views that access the most commonly queried data elements.
Setup and Configuration Wizards. Configuring a DV server to access the multiple data sources can be a very time consuming exercise; the administrator needs to define and configure every source repository, the underlying tables (or files), along with the individual data fields. To simplify setup, a configuration wizard reviews the dictionary of an available data source and generates the necessary DV Server configuration details. It further analyzes the table and column names to simplify naming conventions, joins, and data value conversion and standardization details.
Don’t be misled into thinking that Data Virtualization is a highly mature product space where all of the products are nearly identical. They aren’t. Most product vendors spend more time discussing their unique features instead of offering metrics about their their core features. It’s important to remember that every Data Virtualization product requires a server that retrieves and processes data to fulfill query requests. This technology is not a commodity, which means that details like setup/configuration time, query performance, and advanced features can vary dramatically across products. Benchmark and test drive the technology before buying.
Role of an Executive Sponsor
It’s fairly common for companies to assign Executive Sponsors to their large projects. “Large” typically reflects budget size, the inclusion of cross-functional teams, business impact, and complexity. The Executive Sponsor isn’t the person running and directing the project on a day-to-day basis; they’re providing oversight and direction. He monitors project progress and ensures that tactics are carried out to support the project’s goals and objectives. He has the credibility (and authority) to ensure that the appropriate level of attention and resources are available to the project throughout its entire life.
While there’s nearly universal agreement on the importance of an Executive Sponsor, there seems to be limited discussion about the specifics of the role. Most remarks seem to dwell on the importance on breaking down barriers, dealing with roadblocks, and effectively reacting to project obstacles. While these details make for good PowerPoint presentations, project success really requires the sponsor to exhibit a combination of skills beyond negotiation and problem resolution to ensure project success. Here’s my take on some of the key responsibilities of an Executive Sponsor.
Inspire the Stakeholder Audience
Most executives are exceptional managers that understand the importance of dates and budgets and are successful at leading their staff members towards a common goal. Because project sponsors don’t typically have direct management authority over the project team, the methods for leadership are different. The sponsor has to communicate, captivate, and engage with the team members throughout all phases of the project. And it’s important to remember that the stakeholders aren’t just the individual developers, but the users and their management. In a world where individuals have to juggle multiple priorities and projects, one sure-fire way to maintain enthusiasm (and participation) is to maintain a high-level of sponsor engagement.
Understand the Project’s Benefits
Because of the compartmentalized structure of most organizations, many executives aren’t aware of the details of their peer organizations. Enterprise-level projects enlist an Executive Sponsor to ensure that the project respects (and delivers) benefits to all stakeholders. It’s fairly common that any significantly sized project will undergo scope change (due to budget challenges, business risks, or execution problems). Any change will likely affect the project’s deliverables as well as the perceived benefits to the different stakeholders. Detailed knowledge of project benefits is crucial to ensure that any change doesn’t adversely affect the benefits required by the stakeholders.
Know the Project’s Details
Most executives focus on the high-level details of their organization’s projects and delegate the specifics to the individual project manager. When projects cross organizational boundaries, the executive’s tactics have to change because of the organizational breadth of the stakeholder community. Executive level discussions will likely cover a variety of issues (both high-level and detailed). It’s important for the Executive Sponsor to be able to discuss the brass tacks with other executives; the lack of this knowledge undermines the sponsor’s credibility and project’s ability to succeed.
Hold All Stakeholders Accountable
While most projects begin with everyone aligned towards a common goal and set of tactics, it’s not uncommon for changes to occur. Most problems occur when one or more stakeholders have to adjust their activities because of an external force (new priorities, resource contention, etc.). What’s critical is that all stakeholders participate in resolving the issue; the project team will either succeed together or fail together. The sponsor won’t solve the problem; they will facilitate the process and hold everyone accountable.
Stay Involved, Long Term
The role of the sponsor isn’t limited to supporting the early stages of a project (funding, development, and deployment); it continues throughout the life of the project. Because most applications have a lifespan of no less than 7 years, business changes will drive new business requirements that will drive new development. The sponsor’s role doesn’t diminish with time – it typically expands.
The overall responsibility set of an Executive Sponsor will likely vary across projects. The differences in project scope, company culture, business process, and staff resources across individual projects inevitably affect the role of the Executive Sponsor. What’s important is that the Executive Sponsor provides both strategic and tactical support to ensure a project is successful. An Executive Sponsor is more than the project’s spokesperson; they’re the project CEO that has equity in the project’s outcome and a legitimate responsibility for seeing the project through to success.
Photo “American Alligator Crossing the Road at Canaveral National Seashore”courtesy of Photomatt28 (Matthew Paulson) via Flickr (Creative Commons license).
Underestimating the Project Managers
By Evan Levy
One of the most misunderstood roles on a BI team is the Project Manager. All too often the role is defined as an administrative set of activities focused on writing and maintaining the project plan, tracking the budget, and monitoring task completion. Unfortunately IT management rarely understands the importance of domain knowledge—having BI experience—and leadership skills.
To assign a BI project manager who has no prior BI experience is an accident waiting to happen. Think about a homeowner who decides to build a new house. He retains a construction company and the foreman has never built a house before. You’d want fundamental knowledge of demolition, framing, plumbing, wiring, and so on. The foreman would need to understand that the work was being done in the right way.
Unfortunately IT managers think they can position certified project managers on BI teams without any knowledge of BI-specific development processes, business decision-making, data content, or technology. We often find ourselves coaching these project managers on the differences in BI development, or introducing concepts like staging areas or federated queries. This is time that could be better spent transferring knowledge and formalizing development processes with a more seasoned project lead.
In order for a project team to be successful, the project manager should have strong leadership skills. The ability to communicate a common goal and ensure focus is both art and science. But BI project managers often behave more like bureaucrats, requesting task completion percentages and reviewing labor hours. They are rarely invested in whether the project is adhering to development standards, if permanent staff is preparing to take ownership of the code, or whether the developers are collaborating.
An effective BI project manager should be a project leader. He or she should understand that the definition of success is not a completed project plan or budget spreadsheet, but rather that the project delivers usable data and fulfills requirements. The BI project manager should instill the belief that success doesn’t mean task completion, but delivery against business goals.