I just finished reading an article on data pipelines and how this approach to accessing and sharing data will improve and simplify data access for analytics developers and users. The key tenets of the data pipeline approach include simplifying data access by ensuring that pipelines are visible and reusable, and delivering data that is discoverable, shareable, and usable. The article covered the details of placing the data on a central platform to make it available, using open source utilities to simplify construction, transforming the data to make the data usable, and cataloging the data to make it discoverable. The idea is that data should be multipurpose, not single use. Building reusable code that delivers source data sets that are easily identified and used has been around since the 1960’s. It’s a great idea and even simpler now with today’s technologies and methods than it was 50+ years ago.
The idea of reusable components is a concept that has been in place in the automobile industry for many years. Why create custom nuts, bolts, radios, engines, and transmissions if the function they provide isn’t unique and doesn’t differentiate the overall product? That’s why GM, Ford and others have standard parts that are used across their numerous products. The parts, their capabilities, and specifications are documented and easily referenceable to ensure they are used as much as possible. They have lots of custom parts too; those are the ones that differentiate the individual products (exterior body panels, bumpers, windshields, seats, etc.) Designing products that maximize the use of standard parts dramatically reduces the cost and expedites delivery. Knowing which parts to standardize is based on identifying common functions (and needs) across products.
It’s fairly common for an analytics team to be self-contained and focused on an individual set of business needs. The team builds software to ingest, process, and load data into a database to suit their specific requirements. While there might be hundreds of data elements that are processed, only those elements specific to the business purpose will be checked for accuracy and fixed. There’s no attention to delivering data that can be used by other project teams, because the team isn’t measured or rewarded on sharing data; they’re measured against a specific set of business value criteria (functionality, delivery time, cost, etc.)
This creates the situation where multiple development teams ingest, process, and load data from the same sources for their individual projects. They all function independently and aren’t aware of the other teams’ activities. I worked with a client that had 14 different development teams each loading data from the same source system. They didn’t know what each other was doing nor were they aware that there was any overlap. While data pipelining technology may have helped this client, the real challenge wasn’t tooling, it was the lack of a methodology focused on sharing and reuse. Every data development effort was a custom endeavor; there was no economies-of-scale or reuse. Each project team built single use data, not multipurpose data that could be shared and reused.
The approach to using standard and reusable parts requires a long-term view of product development costs. The initial cost for building standard components is expensive, but it’s justified in reduced delivery costs through reuse in future projects. The key is understanding which components should be built for reuse and which parts are unique and are necessary for differentiation. Any organization that takes this approach invests in staff resources that focus on identifying standard components and reviewing designs to ensure the maximum use of standard parts. Success is also dependent on communicating across the numerous teams to ensure they are aware of the latest standard parts, methods, and practices.
The building of reusable code and reusable data requires a long-term view and an understanding of the processing functions and data that can be shared across projects. This approach isn’t dependent on specific tooling; it’s about having the development methods and staff focused on ensuring that reuse is a mandatory requirement. Data Pipelining is indeed a powerful approach; however, without the necessary development methods and practices, the creation of reusable code and data won’t occur.
There’s nearly universal agreement within most companies that all development efforts should generate reusable artifacts. Unfortunately, the reality is that this concept gets more lip service than attention. While most companies have lots of tools available to support the sharing of code and data, few companies invest in their staff members to support such techniques. It’s rare that I’ve seen any organization identify staff members that are tasked with establishing data standards and require the review of development artifacts to ensure the sharing and reuse of code and data. Even fewer organizations have the data development methods that ensure collaboration and sharing occurs across teams. Everyone has collaboration tools, but the methods and practices to utilize them to support reuse isn’t promoted (and often doesn’t even exist).
The automobile industry learned that building cars in a custom manner wasn’t cost effective; using standard parts became a necessity. While most business and technology executives agree that reusable code and shared data is a necessity, few realize that their analytics teams address each data project in a custom, build-from-scratch manner. I wonder if the executives responsible for data and analytics have ever considered measuring (or analyzing) how much data reuse actually occurs?
I wrote last time about the challenges that companies have in their transition to becoming data driven. Much has been written about the necessity of the business audience needing to embrace change. I thought I’d spend a few words discussing the other participant in a company’s data-driven transition: the Information Technology (IT) organization.
One of the issues that folks rarely discuss is that many IT organizations haven’t positioned themselves to support a data-driven culture. While most have spent a fortune on technology, the focus is always about installing hardware, building platforms, acquiring software, developing architectures, and delivering applications. IT environments focus on streamlining the construction and maintenance of systems and applications. While this is important, that’s only half the solution for a data-driven organization. A data-driven culture (or philosophy) requires that all of a company’s business data is accessible and usable. Data has to be packaged for sharing and use.
Part of the journey to becoming data-driven is ensuring that there’s a cultural adjustment within IT to support the delivery of applications and data. It’s not just about dropping data files onto servers for users to copy. It’s about investing in the necessary methods and practices to ensure that data is available and usable (without requiring lots of additional custom development).
Some of the indicators that your IT organization isn’t prepared or willing to be data-driven include
- There’s no identified Single Version Of Truth (SVOT).
There should be one place where the data is stored. While this is obvious, the lack of a single agreed to data location creates the opportunity to have multiple data repositories and multiple (and conflicting) sets of numbers. Time is wasted disputing accuracy instead of being focused on business analysis and decision making.
- Data sharing is a courtesy, not an obligation.
How can a company be data driven if finding and accessing data requires multiple meetings and multiple approvals for every request? If we’re going to run the business by the numbers, we can’t waste time begging or pleading for data from the various system owners. Every application system should have two responsibilities: processing transactions and sharing data.
- There’s no investment in data reuse.
The whole idea of technology reuse has been a foundational philosophy for IT for more than 20 years: build once, use often. While most IT organizations have embraced this for application development, it’s often overlooked for data. Unfortunately, data sharing activities are often built as a one-off, custom endeavor. Most IT teams manage 100’s or 1000’s of file extract programs (for a single system) and have no standard for moving data packets between applications. There’s no reuse; every new request gets their own extract or service/connection.
- Data accuracy and data correction is not a responsibility
Most IT organizations have invested in data quality tools to address data correction and accuracy, but few ever use them. It’s surprising to me that any shop allows new development to occur without requiring the inclusion of a data inspection and correction process. How can a business person become data driven if they can’t trust the data? How can you expect them to change if the IT hasn’t invested in fixing the data when it’s created (or at least shared)?
It’s important to consider that enabling IT to support a data-driven transition isn’t realistic without investment. You can’t expect staff members that are busy with their existing duties to absorb additional responsibilities (after all, most IT organizations have a backlog). If a company wants to transition to a data-driven philosophy, you have to allow the team members to learn new skills to support the additional activities. And, there needs to be staff members available to do the work.
There’s only one reason to transition to being a data-driven organization; it’s about more profit, more productivity, and more business success. Consequently, there should be funds available to allow IT to support the transition.
I just read this article by Ethan Knox, “Is Your Company Too Dumb to be Data Driven” and was intrigued to read what many people have discussed for years. I’ve spent nearly half my career helping clients make the transition from running the business by tribal knowledge and gut instinct to running the business by facts and numbers. It’s a hard transition. One that takes vision, motivation, discipline, and courage to change. It also takes a willingness to learn something new.
While this article covers a lot of ground, I wanted to comment on one of points made in the article: the mistake of “build it and they will come”. This occurs when an organization is enthusiastic about data and decides to build a data warehouse (or data lake) and load it with all the data from the company’s core application systems (sales, finance, operations, etc.) The whole business case depends on the users flocking to the system, using new business intelligence or reporting tools, and uncovering numerous high value business insights. All too often, the results reflect a large monolithic data platform that contains lots of content but hasn’t been designed to support analysis or decision making by the masses.
There are numerous problems with this approach – and the path to data and analytics enlightenment is littered with mistakes where companies took this approach. Don’t assume that successful companies that have embraced data and analytics didn’t make this mistake (it’s a very common mistake). Successful companies were those that were willing to learn from their mistakes – and have a culture where new project efforts are carefully scoped to allow mistakes, learning, and evolution. It’s not that they’re brilliant; successful companies understand that transitioning to being data driven company requires building knowledge. And, the process of learning takes time, includes mistakes, requires self-analysis, and must be managed and mentored carefully. They design their projects assuming mistakes and surprises occur, so they fail fast and demand continual measurement and corrective action. It’s not about the methodology or development approach. A fail-fast philosophy can work with any type of development methodology (agile, iterative, waterfall). The path to data enlightenment will include lots of mistakes.
Do you remember high school math? When you were presented with a new concept, you were given homework that allowed you to learn, gain experience, and understand the concept through the act of “doing”. Homework was often graded based on effort, not accuracy (if you did it, you got credit, whether or not it was correct). Where is it written that (upon graduation) learning something new wouldn’t require the act of “doing” and making mistakes to gain enlightenment? By the way, who has ever succeeded without making mistakes?
The point the article frequently references it that business engagement is critical. It’s not about the users participating a few times (requirements gathering and user acceptance testing); it’s about users being engaged to review results and participate in the measurement and corrective action. It’s about evolving from a culture where the relationship is customer/ provider to a team where everyone succeeds or fails based on business measurement.
It’s not that a company is too dumb to succeed with data; it’s that they’re often too fearful of mistakes to succeed. And in the world of imperfect data, exploding data volumes, frequent technology changes, and a competitive business environment, mistakes are an indication of learning. Failure isn’t a reflection of mistakes, it’s a reflection of poor planning, lack of measurement, and an inability to take corrective action.
In order to prepare for the cooking gauntlet that often occurs with the end of year holiday season, I decided to purchase a new rotisserie oven. The folks at Acme Rotisserie include a large amount of documentation with their rotisserie. I reviewed the entire pile and was a bit surprised by the warranty registration card. The initial few questions made sense: serial number, place of purchase, date of purchase, my home address. The other questions struck me as a bit too inquisitive: number of household occupants, household income, own/rent my residence, marital status, and education level. Obviously, this card was a Trojan horse of sorts; provide registration details –and all kinds of other personal information. They wanted me to give away my personal information so they could analyze it, sell it, and make money off of it.
Companies collecting and analyzing consumer data isn’t anything new –it’s been going on for decades. In fact, there are laws in place to protect consumer’s data in quite a few industries (healthcare, telecommunications, and financial services). Most of the laws focus on protecting the information that companies collect based on their relationship with you. It’s not the just details that you provide to them directly; it’s the information that they gather about how you behave and what you purchase. Most folks believe behavioral information is more valuable than the personal descriptive information you provide. The reason is simple: you can offer creative (and highly inaccurate) details about your income, your education level, and the car you drive. You can’t really lie about your behavior.
I’m a big fan of sharing my information if it can save me time, save me money, or generate some sort of benefit. I’m willing to share my waist size, shirt size, and color preferences with my personal shopper because I know they’ll contact me when suits or other clothing that I like is available at a good price. I’m fine with a grocer tracking my purchases because they’ll offer me personalized coupons for those products. I’m not okay with the grocer selling that information to my health insurer. Providing my information to a company to enhance our relationship is fine; providing my information to a company so they can share, sell, or otherwise unilaterally benefit from it is not fine. My data is proprietary and my intellectual property.
Clearly companies view consumer data to be a highly valuable asset. Unfortunately, we’ve created a situation where there’s little or no cost to retain, use, or abuse that information. As abuse and problems have occurred within certain industries (financial services, healthcare, and others), we’ve created legislation to force companies to responsibly invest in the management and protection of that information. They have to contact you to let you know they have your information and allow you to update communications and marketing options. It’s too bad that every company with your personal information isn’t required to behave in the same way. If data is so valuable that a company retains it, requiring some level of maintenance (and responsibility) shouldn’t be a big deal.
It’s really too bad that companies with copies of my personal information aren’t required to contact me to update and confirm the accuracy of all of my personal details. That would ensure that all of the specialized big data analytics that are being used to improve my purchase experiences were accurate. If I knew who had my data, I could make sure that my preferences were up to date and that the data was actually accurate.
It’s unfortunate that Acme Rotisserie isn’t required to contact me to confirm that I have 14 children, an advanced degree in swimming pool construction, and that I have Red Ferrari in my garage. It will certainly be interesting to see the personalized offers I receive for the upcoming Christmas shopping season.
There’s nothing more frustrating than not being able to rely upon a business partner. There’s lots of business books about information technology that espouses the importance of Business/IT alignment and the importance of establishing business users as IT stakeholders. The whole idea of delivering business value with data and analytics is to provide business users with tools and data that can support business decision making. It’s incredibly hard to deliver business value when half of the partnership isn’t stepping up to their responsibilities.
There’s never a shortage of rationale as to why requirements haven’t been collected or recorded. In order for a relationship to be successful, both parties have to participate and cooperate. Gathering and recording requirements isn’t possible if the technologist doesn’t meet with the users to discuss their needs, pains, and priorities. Conversely, the requirements process won’t succeed if the users won’t participate. My last blog reviewed the excuses that technologists offered for explaining the lack of documented requirements; this week’s blog focuses on remarks I’ve heard from business stakeholders.
- “I’m too busy. I don’t have time to talk to developers”
- “I meet with IT every month, they should know my requirements”
- “IT isn’t asking me for requirements, they want me to approve SQL”
- “We sent an email with a list of questions. What else do they need?”
- “They have copies of reports we create. That should be enough.”
- “The IT staff has worked here longer than I have. There’s nothing I can tell them that they don’t already know”
- “I’ve discussed my reporting needs in 3 separate meetings; I seem to be educating someone else with each successive discussion”
- “I seem to answer a lot of questions. I don’t ever see anyone writing anything down”
- “I’ll meet with them again when they deliver the requirements I identified in our last discussion.
- “I’m not going to sign off on the requirements because my business priorities might change – and I’ll need to change the requirements.
Requirements gathering is really a beginning stage for negotiating a contract for the creation and delivery of new software. The contract is closed (or agreed to) when the business stakeholders agree to (or sign-off on) the requirements document. While many believe that requirements are an IT-only artifact, they’re really a tool to establish responsibilities of both parties in the relationship.
A requirements document defines the data, functions, and capabilities that the technologist needs to build to deliver business value. The requirements document also establishes the “product” that will be deployed and used by the business stakeholders to support their business decision making activities. The requirements process holds both parties accountable: technologists to build and business stakeholders to use. When two organizations can’t work together to develop requirements, it’s often a reflection of a bigger problem.
It’s not fair for business stakeholders to expect development teams to build commercial grade software if there’s no participation in the requirements process. By the same token, it’s not right for technologists to build software without business stakeholder participation. If one stakeholder doesn’t want to participate in the requirements process, they shouldn’t be allowed to offer an opinion about the resulting deliverable. If multiple stakeholders don’t want to participate in a requirements activity, the development process should be cancelled. Lack of business stakeholder participation means they have other priorities; the technologists should take a hint and work on their other priorities.
I received a funny email the other day about excuses that school children use to explain why they haven’t done their homework. The examples were pretty creative: “my mother took it to be framed”, “I got soap in my eyes and was blinded all night”, and (an oldie and a goody) –“my dog ate my homework”. It’s a shame that such a creative approach yielded such a high rate of failure. Most of us learn at an early age that you can’t talk your way out of failure; success requires that you do the work. You’d also think that as people got older and more evolved, they’d realize that there’s very few shortcuts in life.
I’m frequently asked to conduct best practice reviews of business intelligence and data warehouse (BI/DW) projects. These activities usually come about because either users or IT management is concerned with development productivity or delivery quality. The review activity is pretty straight forward; interviews are scheduled and artifacts are analyzed to review the various phases, from requirements through construction to deployment. It’s always interesting to look at how different organizations handle architecture, code design, development, and testing. One of the keys to conducting a review effort is to focus on the actual results (or artifacts) that are generated during each stage. It’s foolish to discuss someone’s development method or style prior to reviewing the completeness of the artifacts. It’s not necessary to challenge someone approach if their artifacts reflect the details required for the other phases.
And one of the most common problems that I’ve seen with BI/DW development is the lack of documented requirements. Zip – zero –zilch – nothing. While discussions about requirements gathering, interview styles, and even document details occur occasionally, it’s the lack of any documented requirements that’s the norm. I can’t imagine how any company allows development to begin without ensuring that requirements are documented and approved by the stakeholders. Believe it or not, it happens a lot.
So, as a tribute to the creative school children of yesterday and today, I thought I would devote this blog to some of the most creative excuses I’ve heard from development teams to justify their beginning work without having requirements documentation.
- “The project’s schedule was published. We have to deliver something with or without requirements”
- “We use the agile methodology, it’s doesn’t require written requirements”
- “The users don’t know what they want.”
- “The users are always too busy to meet with us”
- “My bonus is based on the number of new reports I create. We don’t measure our code against requirements”
- “We know what the users want, we just haven’t written it down”
- “We’ll document the requirements once our code is complete and testing finished”
- “We can spend our time writing requirements, or we can spend our time coding”
- “It’s not our responsibility to document requirements; the users need to handle that”
- “I’ve been told not to communicate with the business users”
Many of the above items clearly reflect a broken set of management or communication methods. Expecting a development team to adhere to a project schedule when they don’t have requirements is ridiculous. Forcing a team to commit to deliverables without requirements challenges conventional development methods and financial common sense. It also reflects leadership that focuses on schedules, utilization and not business value.
A development team that is asked to build software without a set of requirements is being set up to fail. I’m always astonished that anyone would think they can argue and justify that the lack of documented requirements is acceptable. I guess there are still some folks that believe they can talk their way out of failure.
I presented a webinar a few weeks back that challenged the popular opinion that the only way to be successful with data science was to hire an individual that has a swiss army knife of data skills and business acumen. (The archived webinar link is http://goo.gl/Ka1H2I )
While I can’t argue on the value of such abilities, my belief is that these types of individuals are very rare, and the benefits of data science is something that can be valued by every company. Consequently, my belief is that you can approach data science successfully through building a team of focused staff members, providing they cover 5 role areas: Data Services, Data Engineer, Data Manager, Production Development, and the Data Scientist.
I received quite a few questions during and after the August 12th webinar, so I thought I would devote this week’s blog to those questions (and answers). As is always the case with a blog, feel free to comment, respond, or disagree – I’ll gladly post the feedback below.
Q: In terms of benefits and costs, do you have any words of wisdom in building a business case that can be taken to business leadership for funding
A: Business case constructs vary by company. What I encourage folks to focus on is the opportunity value in supporting a new initiative. Justifying an initial data science initiative shouldn’t be difficult if your company already supports individuals analyzing data on their desktops. We often find collecting the existing investment numbers and the results of your advanced analytics team (SAS, R, SPSS, etc.) often justifies delving into the world of Data Science
Q: One problem is that many business leaders do not have a concept of what goes into a scientific discovery process. They are not schooled as scientists.
A: You’re absolutely correct. Most managers are focused on establishing business process, measuring progress, and delivering results. Discovery and exploration isn’t always a predictable process. We often find that initial Data Science initiatives are more likely to be successful if the environment has already adopted the value of reporting and advanced analytics (numerical analysis, data mining, prediction, etc.) If your organization hasn’t fully adopted business intelligence and desktop analysis, you may not be ready for Data Science. If your organization already understands the value of detailed data and analysis – you might want to begin with a more focused analytic effort (e.g. identifying trend indicators, predictive details, or other modeling activities.) We’ve seen data science deliver significant business value, but it also requires a manager that understands the complexities and issues of data exploration and discovery.
Q: One of the challenges that we’ve seen in my company is the desire to force fit Data Science into a traditional IT waterfall development method instead of realizing the benefits of taking a more iterative or agile approach. Is there danger in this approach?
A: We find that the when organizations already have an existing (and robust) business intelligence and analytics environments, there’s a tendency to follow the tried and true practices of defined requirements, documented project plans, managed development, and scheduled delivery. One thing to keep in mind is that the whole premise of Data Science is analyzing data to uncover new patterns or knowledge. When you first undertake a Data Science initiative, it’s about exploration and discovery, not structured deliverables. It’s reasonable to spin up a project team (preferably using an iterative or agile methodology) once the discovery has been identified and there’s tangible business value to build and deploy a solution using the discovery. However, it’s important to allow the discovery to happen first.
You might consider reading an article from DJ Patil (“Building Data Science Teams“) that discusses the importance of having a Production Development role that I mentioned. This is the role that takes on the creation of a production deliverable from the raw artifacts and discoveries made by the Data Science team
Q: It seems like your Data Engineer has a similar role and responsibility set as a Data Warehouse architect or ETL developer
A: The Data Engineers are a hybrid of sorts. They handle all of the data transformation and integration activities and they are also deeply knowledgeable of the underlying data sources and the content. We often find that the Data Warehouse Architect and ETL Developer are very knowledgeable about the data structures of source and target systems, but they aren’t typically knowledgeable on social media content, external sources, unstructured data, and the lower details of the specific data attributes. Obviously, these skills vary from organization to organization. If the individuals in your organization are intimate with this level of knowledge, they may be able to cover the activities associated with a Data Engineer.
Q : What is the difference between the Data Engineers and Data Management team members?
A: Data Engineers focus on retrieving and manipulation data from the various data stores (external and internal). They deal with data transformation, correction, and integration. The Data Management folks support the Data Engineers (thus the skill overlap) but focus more on managing and tracking the actual data assets that are going to be used by data scientists and other analysts within the company (understanding the content, the values, the formats, and the idiosyncrasies).
Q: Isn’t there a risk in building a team of folks with specialized skills (instead of having individuals with a broad set of knowledge). With specialists, don’t we risk freezing the current state of the art, making the organization inflexible to change? Doesn’t it also reduce everyone’s overall understanding of the goal (e.g. the technicians focus on their tools’ functions, not the actual results they’re being expected to deliver)
A: While I see your perspective, I’d suggest a slightly different view. The premise of defining the various roles is to identify the work activities (and skills) necessary to complete a body of work. Each role should still evolve with skill growth — to ensure individuals can handle more and more complex activities. There will continue to be enormous growth and evolution in the world of Data Science in the variety of external data sources, number of data interfaces, and the variety of data integration tools. Establishing different roles ensures there’s an awareness of the breadth of skills required to complete the body of work. It’s entirely reasonable for an individual to cover multiple roles; however, as the workload increases, it’s very likely that specialization will be necessary to support the added work effort. Henry Ford used the assembly line to revolutionize manufacturing. He was able to utilize less skilled workers to handle the less sophisticated tasks so he could ensure his craftsmen continued to focus on more and more specialized (and complex) activities. Data integration and management activities support (and enable) Data Science. Specialization should be focused on the less complex (and more easily staffed) roles that will free up the Data Scientist’s time to allow them to focus on their core strengths.
Q: : Is this intended to be an Enterprise wide team?
A: We’ve seen Data Science teams be positioned as an organizational resource (e.g. specific to support marketing analytics); we’ve also seen teams set up as an enterprise resource. The decision is typically driven by the culture and needs of your company.
Q: Where is the business orientation in the data team? Do you need someone that knows what questions to ask and then take all of the data and distill it down to insights that a CEO can implement.
A: The “business orientation” usually resides with the Data Scientist role. The Data Science team isn’t typically setup to respond to business user requests (like a traditional BI team); they are usually driven by the Data Scientist that understands and is tasked with addressing the priority needs of the company. The Data Scientist doesn’t work in a vacuum; they have to interact with key business stakeholders on a regular basis. However, Data Science shouldn’t be structured like a traditional applications development team either. The teams is focused on discovery and exploration – not core IT development. Take a look at one of the more popular articles on the topic, “Data Scientist: the sexiest job of the 21st century” by Tom Davenport and DJ Patil http://goo.gl/CmCtv9
Photo courtesy of National Archive via Flickr (Creative Commons license).
I’ve been intrigued with all of the attention that the world of Data Science has received. It seems that every popular business magazine has published several articles and it’s become a mainstream topic at most industry conferences. One of the things that struck me as odd is that there’s a group of folks that actually believe that all of the activities necessary to deliver new business discoveries with data science can be reasonably addressed by finding individuals that have a cornucopia of technical and business skills. One popular belief is that a Data Scientist should be able to address all of the business and technical activities necessary to identify, qualify, prove, and explain a business idea with detailed data.
If you can find individuals that comprehend the peculiarities of source data extraction, have mastered data integration techniques, understand parallel algorithms to process tens of billions of records, have worked with specialized data preparation tools, and can debate your company’s business strategy and priorities – Cool! Hire these folks and chain their leg to the desk as soon as possible.
If you can’t, you might consider building a team that can cover the various roles that are necessary to support a Data Science initiative. There’s a lot more to Data Science than simply processing a pile of data with the latest open source framework. The roles that you should consider include:
Manages the various data repositories that feed data to the analytics effort. This includes understanding the schemas, tracking the data content, and making sure the platforms are maintained. Companies with existing data warehouses, data marts, or reporting systems typically have a group of folks focused on these activities (DBAs, administrators, etc.).
Responsible for developing and implementing tools to gather, move, process, and manage data. In most analytics environments, these activities are handled by the data integration team. In the world of Big Data or Data Science, this isn’t just ETL development for batch files; it also includes processing data streams and handling the cleansing and standardization of numerous structured and unstructured data sources.
Handles the traditional data management or source data stewardship role; the focus is supporting development access and manipulation of data content. This includes tracking the available data sources (internal and external), understanding the location and underlying details of specific attributes, and supporting developers’ code construction efforts.
Responsible for packaging the Data Scientist discoveries into a production ready deliverable. This may include (one or) many components: new data attributes, new algorithms, a new data processing method, or an entirely new end-user tool. The goal is to ensure that the discoveries deliver business value.
The team leader and the individual that excels at analyzing data to help a business gain a competitive edge. They are adept at technical activities and equally qualified to lead a business discussion as to the benefits of a new business strategy or approach. They can tackle all aspects of a problem and often lead the interdisciplinary team to construct an analytics solution.
There’s no shortage of success stories about the amazing data discoveries uncovered by Data Scientists. In many of those companies, the Data Scientist didn’t have an incumbent data warehousing or analytics environment; they couldn’t pick up the phone to call a data architect, there wasn’t any metadata documentation, and their company didn’t have a standard set of data management tools. They were on their own. So, the Data Scientist became “chief cook and bottle washer” for everything that is big data and analytics.
Most companies today have institutionalized data analysis; there are multiple data warehouses, lots of dashboards, and even a query support desk. And while there’s a big difference between desktop reporting and processing social media feedback, much of the “behind the scenes” data management and data integration work is the same. If your company already has an incumbent data and analytics environment, it makes sense to leverage existing methods, practices, and staff skills. Let the Data Scientists focus on identifying the next big idea and the heavy analytics; let the rest of the team deal with all of the other work.