Hit Enter to search or Escape to close
Arrayo spoke with industry executives and experts about the keys to success in data strategy. We talked to Chief Data Officers, Chief Digital Officers, CFOs, and experts in financial services, pharma, biotech and healthcare. Our goal is to find an answer to one simple question: What works?
The people we spoke with come from a variety of organizations — some companies are young, agile startups, while others are mature fortune-500 corporations.
In our previous research projects, we discussed data culture and “data nirvana” and about how to find the right balance between data defense and data offense. In this third installment, we focus on DataOps. In this third installment, we focus on the DataOps approach. This third series of Arrayo’s cross-industry research project comprises of four Chapters.
In the first Chapter, we discussed the principles of DataOps, as a way to inform the reader about why DataOps came to pass and the principles of DataOps. In Chapter 2, we discussed the tools that are used for managing and understanding the data a company has. Then, we discussed the different places where data may exist in a company’s data ecosystem. These storage layers are commonly referred to as a data stack.
Now that we have talked about DataOps and many of its associated technologies, we thought that a couple of use cases would be illustrative in how these tools work together to drive success in different types of companies.
An international bank needed an analytic sandbox to support data scientists’ modeling for wealth management analytics. Once this work was completed for their benefit, a secondary use that emerged after was as an easy go-to source for reporting.
For the analytic sandbox, the team elected to use a multi-purpose pipelining and visualization tool called Dataiku. Preloaded datasets were then loaded into an enterprise data lake which would feed into the tool. The data lake ended up with a very extensive catalog, featuring approximately twenty years of data with a web-based data catalog. A workflow tool was then placed on top to help manage entitlements within the user group. All this data helped data scientists to iterate and experiment quickly which enabled them to succeed in their wealth management work.
In addition to the data scientists who create and upload models, internal and external reporting groups are able to use the sandbox to quickly create and demo a reporting solution. The reporting group worked closely with the business clients on specifications and the flexibility of the tooling allowed them to rapidly prototype and iterate. The data has been curated and managed, so the level of trust is high. The reporting functionality can be created quickly using this trusted data to serve a dynamic and fast-moving environment. Once the business partners and the reporting group have agreed on the report output, the reporting group goes back to their more familiar tools like MicroStrategy which is used for governance and for long-term reporting requirements.
One of the challenges in this project has been defining and creating the shared layer of client data. This has taken more effort on the “human” versus the tech side than was anticipated.
Add-on query mechanisms to serve new groups include Impala and Hue. This has enabled the visualization team to create behavioral predictions and model success rates.
Some of the keys to success in this environment is, to manage entitlements well and quickly and get governance in on the ground floor. When the right governance is there from the start, it means the data quality is baked in and trust levels increase. Another key to success is conducting stellar change management. This involves acquiring knowledge of the impact that tech decisions have and always keeping an open door to stakeholders — especially new stakeholders you may not have considered before.
Also, do not reinvent the wheel — if other groups have world-class model frameworks, use what is in place already. This will save time and let you ramp up quickly.
A biotech company uses raw DNA and RNA sequence data from test subjects to make cancer detection earlier and cheaper. The goal is to make cancer detection as easy as a blood test, and to include it as part of everyone’s annual physical.
During a study, huge volumes of genetic sequence data (DNA reads) from test subjects is loaded as raw data into a data lake, using Databricks and Hadoop. The data in this data lake is considered “at rest” since it is the basis of all subsequent analysis.
The data scientists need to use this as the basis of their proprietary analysis on cancer detection. Each group working with the data needs specific transformations and a dedicated data warehouse that will support their research. Spark is used as the engine to drive this work and the transformation layer is GATK (Genome Analysis Tool Kit), an industry-standard suite of programs developed to standardize genomic analysis. The transformed data is dropped into a SQL data warehouses, where data scientists can do their modelling work. In some cases, additional end-user apps sit on top of the SQL data warehouses to enable fast analysis.
A secondary benefit was achieved by designing a quality assurance layer using the same transformation technology that supported the data scientists. The QA process ensured that the data did not fall outside specific tolerance levels for the data that is loaded into the system.
This use case highlights one of the main advantages that startups have when it comes to DataOps: when there is no existing data culture it is easier to set up your data professionals for success. When these companies decide on which technologies to use, it is much easier to pick the right one when long-standing contracts and years-long migrations are not ahead of you down the road if you make the switch. However, even if a company is just starting to modernize, we believe that they will find a lot of value in all the data they have accumulated over the years.
The New York office of a European bank has come under increasing regulatory reporting scrutiny. This story will likely be very familiar in many larger companies, as new reporting requirements seem to be popping up frequently. Being able to deliver these regulatory requirements quickly and completely will help companies stay out of the spotlight and will allow analysts to get back to work.
In this case, data enters from a legacy loan accounting system and a variety of global markets trading systems. A middle-layer does some flat-file preparation and enrichment. Client data is in the cloud, with some daily file feeds. The final report preparation is done in AxiomSL. All this data pipelining was put in place, but the company was having a hard time resolving reporting issues and resolving discrepancies.
The challenge was to monitor and control data that is opaque and difficult to trace. The goal was to push quality problems further “upstream” to the data stewards and operational groups, and to support a wide variety of platforms, including some primary sources that will be decommissioned in favor of newer enterprise applications.
The solution was to deploy a combination of Alteryx, Duco, and Informatica DQ to dip into the “data streams” where practicable, and to craft intelligent checkpoints where possible. Alteryx was used to bring disparate and unwieldy datasets together from a variety of files (including spreadsheets and emailed reports), structured data bases, and some external data (such as public reports made available on the SEC website) to create “red flag” warnings and to conduct investigative foundational work to support IDQ.
Profiling was done in IDQ, followed by data quality rules and controls. Sets of actionable exceptions were created daily by IDQ, and these were packaged for data stewards on a daily or weekly basis. Standardized red-flag warning reports were also created to support controllers and quarterly close activities. Management dashboards were created using Tableau and Power BI. This allowed analysts to notice discrepancies before the reports are in the hands of regulators and correct when needed.
In addition, line-by-line reconciliation was provided using Duco which ensured that data quality was preserved throughout the entire pipeline.
This modernization has improved the accuracy, transparency, and cycle time for reports. While all these tools may feel like a big investment, this company is now ahead on their regulatory reporting and able to quickly deliver on whatever additional requirement comes next.
No matter the size of your company or your industry, there is a place for the principles and tools of DataOps. Good data will drive business success for the foreseeable future and having a good data culture is too important to get wrong. We have loved to hear about all these success stories in data and feel like we are really starting to see companies turn the corner from data as a requirement to data as an asset.