Production-ready Pipelining & Scientific Data Analytics
Research efficiency and effectiveness can improve tremendously when you harness the power of advanced analytics.
We combine the right skills with the agility required to address the rapidly evolving analytical needs of the pharma/biotech industry. Arrayo specializes in providing pipeline solutions that can run in high-performance computing (HPC) based systems as well as support distributed computing networks such as Apache Hadoop File System (HDFS). Depending on the needs of our clients, projected data growth, and performance needs, we have implemented analytic pipelines that directly perform extract-transform-load (ETL) operations and update data lakes or data warehouses in real time. Some projects find benefits to using open-sourced distributed compute frameworks such as Apache Spark, while others benefit from managed container orchestration through Kubernetes and containerization of individual pipeline components.
We support our clients in transforming raw data into meaningful and insightful information more efficiently and effectively. The pipelines we build provide modular libraries implemented as packages, archives, and artifacts that perform low-level Bioinformatics processes. These libraries are implemented using a variety of languages such as Scala (for Spark), Java, Python (PySpark for python-based Spark), and R. The targets of these artifacts and modules are containerized processes that can be implemented as application level components of Bioinformatics pipelines, drastically increasing efficiency and promoting reusable components.
In summary we implement multi-omics analysis pipelines which require a great deal of compute resources, are production ready, and can process millions of samples facilitating extraction of information from vast multidimensional datasets enabling all areas of informatics research.
We leverage the value of existing data sources, both internal and external, by developing ontologies, appropriate data models, data management and implementing relevant data processing tools. Our bioinformatics services include data curation, informatics ontologies (i.e. Gene Ontology, ChEBI, etc…) We have extensive know-how loading and utilizing large public data sources such as 1000 Genomes, TCGA, CCLE, and others.
Data Engineering for Scientific Platforms
Handling heterogenous data for scientific data analytics is a huge challenge in the life sciences, pharma and biotech. Our data engineering teams support the standardization of scientific and healthcare data which unlocks hidden potential and opens analytics to scientific and clinical users in a self-guided manner. This eliminates the needs for significant latencies between requesting data and data stewards filling requests. The automated, repeatable methods enforced by using this platform approach ensures reproducibility between data-driven analytics and supports a more automated, user-driven functionality.
Many complex situations and relationships can be modelled in data tables and columnar formats that are flexible across domains. We have experience implementing different normalized formats of databases using approaches such as the 6th-normal form (6NF) (from Christopher J. Date, et. al). By using these formats and schemas, it allows for trivial join dependencies and tables that are considered “tall and skinny”. Leveraging formats such as PostgreSQL’s binary-json (jsonb) column allows for flexible schemas without overcomplicating a database with permutations of tables that model a high-level concept. From Infrastructure to data model flexibility our team has experience implementing the right technology to fit the use case at hand.
Arrayo has supported many experimental and academic algorithm optimization projects where traditional bioinformatics techniques are simply not scalable, efficient or practical in a commercial setting. By taking a modular application-based approach we assist in optimizing and taking these algorithms to scale. Many of the bioinformatics modules we have written in the past leverage open sourced, community vetted tools in processes implemented in Python or Java. There are multiple, robust community open-sourced biological programming frameworks such as BioPython and BioJava that can enhance the processing of sample data by integration with relational databases for constructing arguments and interpreting data. This moves away from the traditional methods where the contents of a directory are iterated through, and little to no error data / quality metrics are persisted in a usable source.
Our pipeline and integration of individual modules follow the pattern of “dockerization” allowing our clients to package analytics assets making them more scalable and more efficient to run.
At Arrayo, we uniquely combine industry knowledge, statistical, science, analytics and technology expertise having profound positive impact on preclinical and clinical data analytics. We support our clients in optimizing the way they store, process, integrate and aggregate preclinical and clinical data assets. Team Arrayo ensures scientific data analytics excellence by utilizing the most effective technology for each specific use case.