INSIGHTS
Article

DataOps Series — Chapter 2: Tools of Data Management

Arrayo spoke with industry executives and experts about the keys to success in data strategy. We talked to Chief Data Officers, Chief Digital Officers, CFOs, and experts in financial services, pharma, biotech and healthcare. Our goal is to find an answer to one simple question: What works?

The people we spoke with come from a variety of organizations — some companies are young, agile startups, while others are mature fortune-500 corporations.

In our previous research projects, we discussed data culture and “data nirvana” and about how to find the right balance between data defense and data offense. In this third installment, we focus on the DataOps approach. This third series of Arrayo’s cross-industry research project comprises of four Chapters. In Chapter 1, we discussed the Principles of DataOps, as a way to inform the reader about why DataOps came to pass and the principles of DataOps. Here, in Chapter 2, we will discuss the tools that are involved in managing data. In the subsequent Chapters, we will discuss the various components of the Data Stack (Chapter 3), and some Use Cases to illustrate how teams can thrive by having a strong data culture (Chapter 4).

No tool is absolutely essential, but we would recommend every tool in this article as a steppingstone towards getting ahead of your data problems, before they come to exist. How do your business users go about finding the right data quickly? How is that data governed to get the most value from for your consumers? These are the data management functions that will help you understand your data and how you can take control of your data.

Before looking at the various tool categories, let us see what we can expect from a data management toolkit. In a nutshell, these tools help you:

1. Find the data

2. Understand the data

3. Govern the data

4. Extract value from the data

Data Catalog

Your first must-have is a data catalog. A good definition comes from Gartner:

A data catalog creates and maintains an inventory of data assets through the discovery, description and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value.

The data catalog ties everything together. Think of it as “google” for data. Before your analysts and other stakeholders can use the data, they must find the data. A well-curated catalog of all data sources, both internal and external, is critical.

In order to create a powerful data catalog, you must discover the data, understand the data, inventory the data, profile it, tag it, and document the data relationships. These tasks will not change, regardless of how they are done or what type of catalog features you choose. Catalogs can be created manually for small and well-contained data universes that cover understood domains, but it becomes increasingly difficult as disparate sources and domains must tie together and as volume grows. As you grow, you may want to investigate machine-learning augmented catalogs that crawl through the data and do this work based on ML algos. Make sure you plan as the catalog expands. At some point, you will have a need to connect data that is in your Hadoop universe to other sources, such as your enterprise data, and the catalog will start to grow exponentially.

Collaboration Tools

Historically, much development work has been done in isolation throughout most of the digital revolution. Almost every data professional has experienced duplicated work efforts that have gone on just because of a lack of communication. Today, a variety of different technologies are stepping into this communication and collaboration space to make sure that all work around data is well documented and well communicated.

For knowledge capture and sharing, products such as Confluence and other Wiki-like applications allow data engineers, developers, and other experts to create living documents. Paired with agile processes, this can create sustainable workflows so other teams can see work that has been done and what they might leverage or pick up where another team left off.

Another collaboration standard is to ensure that all data and analytics work is kept in source control software such as GitHub or Gitlab. This provides the resources and transparency that allows teams to see not only what was done, but also how it was done.

Tools of Governance and Protection

Once the data is catalogued and collaboration is in place, there are other functions that are needed. Some vendors bundle these with their data catalogs, and others offer standalones that can integrate into a variety of environments.

Lineage Tools

These document data in motion. This is critical to understanding the end-result as seen by your data analysts and other data consumers. Lineage is crucial for DataOps, to enable engineers to do their work, to support transparency, to understand aggregations and transformations, and to enable linkage and integration. This also serves in support of your data catalog, by improving the understanding of the data relationships with other data sets, discovering dependencies of data, and finally by identifying relationships between business assets.

Semantic tools

These include business glossaries, so that people can read in plain language what the data is supposed to mean. If you are looking at marketing data, for example, you will want your glossary to tell you that the term “North America” in the domain for “regions” is defined as customers (or patients) within the U.S. and Canada, and that the term “South America” includes all the continent of South America, plus Mexico, minus Brazil. This is a localized definition of region based on language instead of geography, and it is critical to understanding a dashboard that may show statistics for “Results in North America”.

Other semantic tools are needed to support taxonomies and ontologies. If you intend to create sophisticated modeling and visualizations, you will almost certainly need to utilize an ontology to show the relationships between points of data. Ontology gives you the framework to understand and visualize the data, and the tools to tightly integrate data and business concepts.

Data protection

Data protection involves physical or cyber protection such as firewalls and monitoring of attacks, which this article will not cover. Other dimensions of data protection involve correct entitlement at the data element level, correct categorizing and treatment of sensitive data, and data ownership, data encryption and ethics rules and laws. New vendors are entering this space, because of laws such as GDPR (General Data Protection Regulation) and increasing media focus on data ownership and data privacy.

Data Quality

There are a number of tools for measuring and monitoring data quality, and these are often in place as the result of developing an enterprise data management program. The best tools allow for transparency into the data streams, and the ability to see current data health for all data use cases. Although everyone says they want their data to be 100% clean and accurate, each data use case may in fact have varying requirements. Sometimes the speed of data will take precedence over accuracy. If you are trying to find out if a hurricane will ground your truck fleet, you are better off with 75% accuracy in 30 minutes than 100% accuracy in 30 days.

Summary

Learning to manage the data that exists at a company is essential. If we are going to start instilling the principles that we talked about in our first article, we need to take control of the data at our company. This can be arduous and needs to be done in a systematic manner, but with the right tooling any company can start to get ahead of their data problems. We think that the tools we have mentioned are a great start and it is likely that more tools will be added as a business begins to identify the specific opportunities that exist in their space.

*This article was written for SteepConsult Inc. dba Arrayo by Renée Colwell and John Hosmer.