Case Study |

50x speed of pipeline in scalable production environment

Context

Arrayo has worked with life science focused clients who have the need to scale their scientific data analytics from the research laboratory (dry and wet) to large clinical trials. When these large clinical trials include genomic data, the compute resources can quickly grow out of the environment that was sufficient for preclinical work. In instances such as this, the processing needs to be moved into distributed computational environments such as Apache Spark. Team Arrayo specializes in taking these workflows to the next level in both on-prem, cloud-based, and hybrid instances.

Taking a more specific deep dive, an early stage, clinical diagnostic company needed to implement a high-throughput screening liquid biopsy to detect early stages of cancer. This biopsy utilized multi-omics analysis which required a great deal of compute resources. With no existing compute or storage infrastructure there was a widely available choice of architectures to develop a cost-effective, efficient, and scalable system. The anticipated number of annual samples that would go through the pipeline was in the millions; reducing analysis time was absolutely critical to our clients success.

The Project

Data Infrastructure

Given the need for large amounts of data to be stored and processed, the Team Arrayo  worked with the client to transfer their legacy data into a Microsoft Azure data lake. Our client was working towards clinical trials so history and traceability of the data were paramount. With this in mind the Arrayo team decided to work with the new delta lake storage layer (as offered by Databricks). This type of change isn’t burdensome on the development team but allows for much better handling of metadata while creating more reliability in the system.

Pipeline Transfer

Now that the client had developed a more advanced data infrastructure which could handle their data demands, it was up to the Arrayo team to build out pipelines that could perform the proper computations. Utilizing R, Python, and Scala APIs for Apache Spark, Arrayo and the client worked closely to ensure that the computations could run within the required time tolerances.

A high throughput Hadoop File System (HDFS) was implemented in a cloud-based service in conjunction with managed Apache Spark runtime infrastructure. These automated analysis pipelines were executable by simple RESTful API requests that were exposed to laboratory information management systems (LIMS) in an effort to trigger automated cluster acquisition, high-throughput bioinformatics analysis in Apache Spark and robust output to an HDFS implementation of a “delta-lake” (a datalake that enforces ACID (atomicity, consistency, isolation, and durability)-based transactions).

In the most burdensome pipeline the client had a 50x increase in run time was seen. Along with increasing pipeline efficiency of the pipelines, the Arrayo team built out a testing suite to ensure that the new distributed pipelines would give the same results when prompted with the same data as before. The thorough concordance testing ensured that the team remained on the path of continuous integration. As soon as non-concordance is identified previous test cases can be utilized to ensure new changes will not have unintended consequences on said pipelines.

The inclusion of a fully cloud-based architecture agnostic of cloud vendors allowed a portable and scalable code base that could be implemented in any future iterations of cloud deployment. Development operations (DevOps) were handled by the use of open source code repository implementations of Git, open build engine implementations of Jenkins, and automated artifact and containerization storage in various cloud repositories. Arrayo played a major role in the architecture, development, and implementation of the bioinformatics pipelines, datalake formation, and DevOps related activities throughout the duration of the project.

Outcome

Increased speed of pre-existing academic and new pipelines by 50x in a scalable production environment.

Established an advanced testing infrastructure to ensure concordant results with legacy pipeline and that new pipelines would be concordant with pipelines used in previous clinical trials.

Moved to new data infrastructure to increase transparency, traceability and ensure that all data models were preserved over time.