Case Study |

50x speed of pipeline in scalable production environment

Context

Arrayo has works with life science focused clients that have the need to scale their scientific data analytics from the research laboratory (dry and wet) to large clinical trials. When these large clinical trials include genomic data, the compute resources required can quickly grow out of the environment that was sufficient for prior preclinical work. In instances such as this, the processing needs to be moved into distributed computational environments such as Apache Spark. Team Arrayo specializes in just taking these workflows to the next level in both on-prem, cloud-based, and hybrid instances.

The Project

Data Infrastructure

Given the need for large amounts of data to be housed and processed, the Arrayo team worked with the client to transfer their legacy data into a Microsoft Azure data lake. Since the group was working towards clinical trials, history and traceability of any changes made to the data were paramount which lead the Arrayo team to work with the new delta lake storage layer as offered by Databricks. This type of change isn’t burdensome on the development team but allows for better handling of metadata and more reliability to the system.

Pipeline Transfer

Now that the client had more data infrastructure that could handle their demands for size, it was up to the Arrayo team to help build out pipelines that could perform that proper computation. Using R, Python, and Scala APIs for Apache Spark, the client worked closely with Arrayo’s team to ensure that the computation could run within reasonable time tolerances. In the most burdensome pipeline that the client had, there was a more than 50x speedup. Along with speeding up the pipelines, the Arrayo team built out a testing suite to ensure that the new distributed pipeline would give the same results when prompted with the same data as before. The thorough concordance tests ensure that the team remains on the path of continuous integration, as soon as non-concordance is identified they can utilize the previous test cases to ensure new changes don’t have unintended consequences on the pipeline.

Outcome

Increased speed of existing academic pipeline by 50x in a scalable production environment

Ensured concordant results with legacy pipeline used in prior trials

Moved to new data infrastructure to increase transparency