INSIGHTS
Article

DataOps Series — Chapter 3: A Data Stack

Arrayo spoke with industry executives and experts about the keys to success in data strategy. We talked to Chief Data Officers, Chief Digital Officers, CFOs, and experts in financial services, pharma, biotech and healthcare. Our goal is to find an answer to one simple question: What works?

The people we spoke with come from a variety of organizations — some companies are young, agile startups, while others are mature fortune-500 corporations.

In our previous research projects, we discussed data culture and “data nirvana” and about how to find the right balance between data defense and data offense. In this third installment, we focus on the DataOps approach. This third series of Arrayo’s cross-industry research project comprises of four Chapters.

In Chapter 1, we discussed the Principles of DataOps as a way to inform the reader about why DataOps came to pass and the principles of DataOps. In Chapter 2, we discussed the tools that are involved in managing data. Here in Chapter 3, we will discuss the various components of a Data Stack, and in the fourth and last Chapter we will be discussing Use Cases to illustrate how teams can thrive by having a strong data culture.

What is your data stack? Everyone is different, because each enterprise has a diverse set of data repositories, data sources, pipelines, legacy systems, data consumers and data use cases. How dynamic is your data environment? It may be large and diverse, or small and fairly focused. You might see new sources of data being added rapidly, or the universe might be stable with infrequent and well-understood changes. Regardless of the specifics of your strategic, tactical, and operational approach you either inherited or must define the components of your data stack to create, process (curate, transform), store and consume data.

Data Creation

Data enters the data stack where it is created. This could be in enterprise applications, documents, spreadsheets, files, external data, vendor data. Other sources might be streaming in, such as data from RFID tags or measuring devices.

Enterprise Applications

Enterprise apps and systems bundle functionality and data together. The app is there to serve a purpose, and it is creating data at a fast pace. The challenge is to seamlessly bring in the data and the information it represents, possibly on an as-needed basis. Enterprise apps include legacy purpose-built software, vendor products, cloud-based applications, and possibly even older platforms that may be critical to the middle office. The difficulties and roadblocks may be technical such as understanding or pulling out the components of a black-box calculator, or they may be what one specialist we spoke with calls “data MINE-ing”, which is an emotional response to allowing other groups access to “my” data. Often these sources are well-understood by internal groups, with minimal technical challenges. Folding them into the ecosystem may not be difficult once the ball gets rolling.

Files, Spreadsheets, Documents

Having critical data that resides in spreadsheets and documents is not unusual, but it creates a unique challenge. In the past, the push was to eliminate spreadsheets or files and put them into a data warehouse. Rarely can this be done quickly enough to eliminate the use of off-line sources such as files, spreadsheets, documents, or pdfs.

Newer tools exist to tame these data sources. Machine learning and advanced parsing solutions, based on controlled vocabularies and natural language processing, can be leveraged to find data that resides inside documents, files, spreadsheets, and web pages. Automating business processes that parse documents, spreadsheets, and files will make the resulting data machine readable and more manageable.

External Data and Vendor Data

Most data ecosystems have a variety of external data that needs to be folded in. This could be vendor data such as Bloomberg in Finance. It could be the results of clinical trials, Real World or TCGA data for pharma, or scrubbed patient data. It may be external datasets that sit in the cloud. It may be weather data, or data streaming in from RFID tags. It may be pdfs on an external site or Electronic Medical Records.

Some of this data is well-curated, well-understood, and incorporates tamely into your ecosystem already. Other sources will be more difficult to manage and will require the use of parsers before they can be brought in or you can utilize AI tools to read, extract, wrangle, and clean errors on the fly.

Storage and Processing

Once the data has entered your data stack, the middle layer is where the data becomes usable. This is where you will find the data lakes, data hubs, data repositories, data warehouses, and datamarts. On top of that, you may have virtualization layers and analytics platforms. Pipelines move your data, transform your data, and connect disparate data sources together.

Data Transformations

Now that you have located your data, it is important to make sure it is in the correct location and that you understand exactly what it means. Most data will go through at least one transformation before it becomes usable for most of your data consumers.

One kind of transformation we will consider is data curation. Every data professional is well aware of the issues that exist in almost every dataset that can be found. Whether this means completely bad data or data that needs to be altered before consumption, data curation is an important step to undergo with your data. This process must be done at least once with every dataset but being able to remove anomalies or flag possible inaccuracies should be a target for any important dataset at your company.

Another transformation that usually must occur is mapping of data. When we encounter a new dataset, we likely want to incorporate it with our current data sources. To do this successfully we must map the fields from the new dataset to the existing data as best as we can. This occurs often in almost every field and can be a time-consuming project. Whether this means standardizing new data from the ledger of an acquisition or transforming lab data into existing formats for your computational team, this is a must.

Transformations like this happen all the time and are usually made into repeatable processes. This is usually performed alongside other ETL and is a type of pipeline which we will discuss more.

Pipelines

The term Data Pipeline applies to almost anything that moves or connects data. Pipeline technologies have improved to the point where some companies use them to support a democratized data analysis across a wide swath of the ecosystem. Pipelines can collect, curate, transform and harmonize data types to enable complex analysis to address specific scientific challenges. i.e. scientific analytical pipelines transform and standardize biological data types to perform complex analysis and produce results that address specific scientific challenges. Many are Open Source (Python based Apache’s Airflow, Luigi, petl…, Nextflow and Broad’s Cromwell made for scientific workflows, KNIME, Apache’s Kafka…)

Numerous offerings exist in this space and even more vendors are expected to enter the sector. Alteryx, DataBricks, Dassault, Informatica, Xplenty, and Talend are a few brand names that come to mind, but no matter what kind of data pipeline you are trying to build, there is a vendor ready and eager to help you out. There are solutions for every need, from small niche players to large enterprise-wide installations. A trend in this sector is to automate data flows and provide out of the box platforms ready for self-service analytics directed to business users. Vendors such as Dataiku and Anaconda offer a framework for data analysts enabling them to write their own pipeline freeing them from data engineering support.

Data Warehouses / Structured Repositories

Data Warehouses and structured data repositories have been with us for many years. These are well understood, and usually well-architected and managed. Their limits become obvious when several of them must be joined to harness the real power of your data. Thankfully, Snowflake is an analytic data warehouse, provided as Software-as-a-Service (SaaS) that allows enterprise data to combine with external data in real-time to produce the fast data that analytics needs.

Some repositories are the data stores that sit behind an enterprise app, and the structure and content of the data is bound tightly with the functionality of the app. Others have been designed with specific consumers in mind — Risk data warehouses, marketing warehouses, analytics data marts, forward and reverse translational analysis. The advantage is a well-understood repository. The disadvantage is that updates and changes can be difficult to implement quickly. i.e. adding a new financial product into a data warehouse is notoriously slow to implement from start to finish, often due to necessary, but time-consuming regression testing.

Data Lakes

Almost every large organization has a Data Lake someplace where their data can be retained for long periods of time at a low price and where data hungry users can go to compile many disparate sources together. A data lake is nothing more than a central location where many different forms of data reside in the same infrastructure. Data lakes have built-in fault tolerance: for example, Hadoop is set to have three replications of all data, making it extremely unlikely for information to be lost. This also allows calculation tools such as Spark to perform large and complicated calculations across distributed data at a speed not possible in the past.

Data Lakes also provide connectors so that data professionals can use languages such as R and Python to extract insights and perform the types of analysis that are mission-critical to your enterprise. Data Lakes have been a mature part of the data ecosystem for some time, due to the simplicity of the infrastructure, the ease of setup, and the value that companies derive from them.

Data Virtualization

Data virtualization ties multiple data sources together, without having to physically move the data. Tools like Denodo can catalog and provide access to the entire data ecosystem, without the need for data replication. Virtualization and integration are terms that are sometimes used interchangeably, although they are different. Integration usually involves technical understanding of the location and format of the data, whereas virtualization allows data retrieval without the need to understand where the data is physically located. A virtualization layer is nothing more than a logical data layer that integrates data across many locations (internal silos or lake with cloud data or spreadsheets, for example). Virtualization might augment your current suite of databases and data lakes, or it may replace consumer datastores. In any case, the end result should be invisible to your data consumers. They do not care how you did it, they just want it to work.

Data Consumers

The final layer is your data consumers. This is where data use cases are generated. These groups contain analytics, data visualizers, data scientists, and enterprise internal and external reporting. Data consumers use tools that range from analytics platforms with REST APIs serving data, to management reports in spreadsheets, and more traditional page-based reporting. A macro in an excel spreadsheet can be considered a reporting tool, and there are other niche and specialized tools such as AxiomSL, which enables and automates the preparation of regulatory reports. Analytics platforms are multifaceted and various, from vendor products to purpose-built internal systems.

Visualization tools have quickly become the tool of choice for data presentation. These tools are now quite mature. Leaders here are Tableau, Spotfire and Power BI, but there are many others in this space. All will create sophisticated dashboards, with a variety of capabilities. Some offer more tools to connect and transform the data before it is displayed, and others specialize in sophisticated visual displays with a rich array of functions for the consumer, including drill downs, drill through, and write backs. No self-respecting end-user visualization tool is complete without a good set of tools to create dashboards and reports for mobile devices- and most have these. Some are geared towards giving power to your dedicated developers, and others have taken the “each one, teach one” route, and makes it easy for your consumers to build their own data visualizations so you can concentrate on the data.

Summary

Data exists all around every modern company and where that data exists there must be the right fit for each particular use case. Obviously, there are a multitude of ways for data to be stored, many of which we did not touch on. Making sure that data can easily move through your company’s data stack is essential to any DataOps operation. Access issues will continue to exist, especially in highly regulated industries, but moving towards a more modern stack can help companies identify new business opportunities or a competitive edge that can drive new profits.

*This article was written for SteepConsult Inc. dba Arrayo by Renée Colwell and John Hosmer.