Case Study

Genomics Pipeline

Our client had an existing data pipeline that they used to connect, clean, validate, annotate, and ingest genomic data from various sources. They faced challenges in their data harmonization efforts in terms of maintenance, updates for new data source versions, troubleshooting, and scalability.


We started by reviewing all the processes together and depicting the pipeline from end-to-end, benchmarking it as a whole as well as per-step to (1) become familiar with the various processes and (2) understand where performances could be improved. Next, we studied potential solutions like Knime, Airflow, and Spark on Amazon Web Services (AWS) EMR and benchmarked them.


Using Spark, the amount of code shrunk considerably, so maintenance became a minor issue and both version updates and scalability became a matter of configuration via the UI. Performances increased notably as the first benchmarks showed a significant improvement in processing time.

Finally, the output fed a NoSQL database that could be exploited using different querying tools, such as plugging in notebooks for python, R, Scala, or Athena depending on the profile of the end user and their way of accessing the data.

Related Insights