Context
Arrayo has partnered with a variety of clients ranging in size from small start-up biotechnology entities to large pharma and healthcare enterprises. Many of the projects have involved data systems engineering both in the cloud and hybridized on-premises and cloud solutions. One such client was a Harvard Medical School spin out that has a mission of creating antibodies and recombinant proteins for the entire repertoire of cell surface proteins expressed on human cells. We partnered with our client to architect, design, and implement a high-throughput laboratory information management system (LIMS) to facilitate the vast array of accumulating data. Our solution is primarily cloud based and hosted on AWS, with a few on-premises compute nodes to facilitate running laboratory instruments and robots.
The Project
Application Database
A relational database was needed to persist LIMS data and other information that must be rapidly available in a structured fashion. We implemented AWS RDS services and hosted PostgreSQL servers to serve as the main persistence layer of our solution.
Many processes and operations in this solution needed a flexible, schema-less datastore to allow for a variety of status tracking (for separate processes). We used the AWS Dynamo database collection for its extremely low latency calls for read and write operations. Our solution was able to use a single table in Dynamo to track the statuses of long-running jobs within our LIMS solution.
Databases are placed in their own Virtual Private Cloud (VPC) for security purposes and not accessible to the public domains (outside AWS this eliminates a potential point of vulnerability at a critical piece of our infrastructure. Data from the databases are served using RESTful APIs that are architected in a microservice environment.
Compute Environments and Hosting Custom Services
Our solution needed multiple RESTful APIs to serve data in the underlying (private VPC) databases hosted by RDS. We have found much success using a microservice architecture in similar systems to facilitate rapid development and continuous integration and deployment. Microservices allow for small, incremental changes to be deployed without the need for traditional, bulky, and interruptive updates.
In this solution, we utilized Java’s Springboot framework to develop and implement a multitude of individual microservices. To lower the burden of configuration, dependency, and environmental compatibility, we used Docker as a containerization solution and utilized AWS Elastic Container Registry (ECR) to host our docker images and maintain versioning.
Storing containers directly within the AWS infrastructure allows for the dynamic and scalable control of their deployment. AWS hosts an Elastic Container Service (ECS) that manages deploying, running, and scaling docker images running on Elastic Cloud Compute (EC2) instances. Our ECS deployments included an application load balancer (ALB) to handle scaling and parallel container / service load balancing. The main benefits to ECS over deploying and running services directly on EC2 include the following:
- Containers and services are 100% available, even during the deployment of an updated version of the service
- Load and usage are balanced, ECS is capable of adding or subtracting parallel containers on-demand and autonomously in response to higher or lower load and higher HTTP request latency times. In contrast, the service will scale down when demand is low in order to lower operational costs
- Log aggregation of individual microservices
- Alerting and fault-tolerance
Object Storage and Handling Large Files
The client has a large component of its LIMS that handles next generation sequencing (NGS) for DNA and RNA. These files can be massive, on the order of 10s to 100s of GB. The binary nature of the files does not suit well for inclusion directly into a database and dictates the need for a cloud-based file storage system.
While processes that used NGS raw read files are in operation, they interact with the files in temporary memory streams that write back to the original file store. For actively processing jobs, files are left in AWS Simple Storage (S3) where they can be quickly retrieved using a variety of protocols such as HTTP.
Our storage engine solution utilized a RESTful API to serve files to processes and applications as well as to write new files to S3. Every file that is associated with the LIMS system was persisted to S3 and its path and other relevant metadata was tracked in the application databases hosted in RDS.
As with our RDS services, S3 buckets are locked down and available only to AWS services. They are not exposed to the public and we restrict interaction with them to take place only through our RESTful API microservices. This ensures that files do not inadvertently become corrupted or mistakenly deleted.
Message Queueing and Long Running Processes
Our client had generated and accumulated a significant amount of data before the development of the LIMS. These data existed in files saved on employee’s laptops and other disparate sources. We prefer to design our services to be (as Apple says) “short and chatty”, meaning our design patterns see the transfer of small amounts of data more frequently as opposed to supplying large chunks to process for long periods of time.
We designed processes that would allow historical files to be uploaded to RESTful APIs where the API would consume the file and produce individual units of work (typically 1 unit of work = 1 row in a spreadsheet) to a message queue where they could be picked up and efficiently processed.
We utilized AWS Simple Queue Service (SQS) to implement a light-weight message bus system that could take the load of processing large files off the RESTful APIs. SQS is an alternative to the other, much higher throughput message queue technologies offered by AWS such as Managed Kafka Service (MKS) and AWS Kinesis. These systems follow a very similar pattern in the sense that a “consumer’ watches a logical queue and receives messages as they come into the bus. The components that place data into queues are referred to as “producers” and essentially are the point of initiation for moving messages through a pipeline.
Many of the processes that analyze NGS data require massive compute resources. To obtain and operate EC2 components to run analysis continuously would be a tremendous waste of money as this use case saw 2 or 3 NGS runs per week. To maximize cost efficiency, we implemented a system that used AWS Batch to acquire EC2 resources, run the job, and drop EC2 resources on successful completion to prevent unnecessary EC2 billing occur from idle processes. AWS Batch integrates with the ECR and allows the dynamic scaling of EC2 resources it will acquire by estimated the size and resources needed to complete individual jobs as they come into the processing queue. We have implemented the ability to control the starting of jobs and passing of arguments and parameters that are required by the underlying docker container managed by AWS Batch using AWS Lambda functions.
AWS Lambda allows for the quick development of HTTP / HTTPS endpoints that can accomplish small, quick tasks such as calling to another AWS managed service to drive eventing or process management. In addition to exposing the control of Batch processes by Lambda, we have also implemented AWS Step Functions within the Batch jobs to allow for different processes that must sequentially be executed (with and without serial dependencies). AWS Step Functions also allow for dynamic sizing of containers at different segments of jobs, this maximized efficiency by using the minimum amount of EC2 resource necessary to execute a job.
Security and Compliance
As mentioned before, databases and sources of data persistence are secured by placing them in a privately secured VPC that is not publicly accessible. Data is accessed through RESTful APIs that are available to potentially public sources, to this end, we use the OAuth 2.0 based security principals for RESTful APIs. OAuth 2.0 requires both an identity server (something to validate credentials such as usernames and passwords) and an authorization server that hands our “tokens’ or encoded values that validate a user, have a finite expiration period, and carry the “claims” of the user. Claims are typically role-based and can allow for the restriction of users to various endpoints hosted in the RESTful APIs. An example of a claim-based authorization would be an “admin” user and a “standard” user. Admin users can interact with RESTful endpoints that the standard user may not be able to. These principals can trickle down to the schemas of databases where sensitive fields on individual tables can be obfuscated for less-privileged users. Resource RESTful APIs must have a “trust” or a shared, asymmetric key with the authorization server so the “tokens’ can be validated as authentic.
Outcome
-
- Digitized and automated numerous laboratory workflows increasing overall laboratory efficiency by 30-40%.
- Created intuitive Graphic User Interfaces (GUIs) that reduce training time and streamlines laboratory work.
- Centralized data capture and integrity to one source of truth for all laboratory data that can be accessed retrospectively, provide data analytics capabilities, and ensure transparency of projects progress across the company.