Elastic Cloud Processing Engine for Genetic Data
A company that specializes in genome processing has hit a computing processing limit. They needed genome processing to be fully automated and put in a cloud.
We have built a scalable cloud-native and cloud-agnostic service for executing genome processing jobs
Our client is an innovative genome processing startup and the world’s largest DNA testing and analysis platform. The company aims to provide end users with a platform to store, use, and understand their genetic data easily. The platform consists of components that securely upload genetic data files of any size directly into the user’s account, a genome processing engine, and various APIs for third-party integration. To accelerate the growth and handle an increasing number of users, our client was looking for an established digital health technology enabler to build a high-performance, elastic cloud system for batch genome processing that automatically scaled out depending on demand.
Requirements to the system were well defined and required the target system to control computational resources in a fine-grained way per each genome processing job to differentiate between various business aspects. For instance, jobs from premium clients should be given more resources and prioritized. Our client required a trusted technology partner to provide top-notch software engineering services in the cloud and distributed system domain.
The urgency of satisfying an increasing load resulted in tight implementation timeframes and the need for exceptional technical expertise to build a high-performance, resilient, and HIPAA-compliant data processing infrastructure, incorporating a series of digital solutions and services. As a result, our client turned to Plexteq as a proven health technology expert proficient in medical software development. After a small trial project that demonstrated our strong product development and delivery management capabilities, our client entered into an extensive, long-term strategic partnership with Plexteq.
Existing genome job processing on in-lab equipment was slow and inefficient
The existing process was semi-automated required manual intervention
Existing genome processing environment was not HIPAA compatible
Genome processing is generally a computationally heavy process. Depending on the kind of processing, such a process may take many hours or even days of CPU time, memory, and I/O operations. Therefore, the main idea of our R&D project was to:
Leverage public clouds where more computational power could be allocated on demand immediately when the end-users needed to perform analytical processing
Use advanced containerization techniques with fine-grained resource control so that computational resources were shared between jobs strictly so that no job could hog system resources or cause other jobs to stall
It was essential to build the solution vendor-lock-free, making it possible to run on any public cloud or on-premises infrastructure.
The Plexteq engineering team developed a cloud-native and cloud-agnostic turnkey service for executing genome processing jobs.
The developed solution allowed smooth integration into the customer’s existing infrastructure through the REST API, enabling a fully automated genome processing pipeline. The previous solution they used was semi-automated, which involved manual actions and was handled by resource-bound in-lab equipment.
The solution includes several modules:
Cloud scheduler – Responsible for allocating and recycling cloud resources, and ensuring that cloud costs are not exceeding the daily/weekly/monthly thresholds
Cloud controller – Ensures that genome processing jobs are executed reliably and in the proper order
Node controller – Starts/stops jobs per requests of the cloud controller, allocates system resources for job execution, retries the job if it fails, and controls the execution process
In addition to the REST API, we also developed a rich web interface that allowed our client to observe jobs, system resources, and various system events, and manage the execution at the runtime.
Cloud-agnostic service with broad public cloud support
Cost-based predictor and wordload optimizer
Industries: Biotech, Healthcare
Expertise: Big Data, Cloud Services
Team size: 8 engineers
Cooperation: 2016 – 2020
Java, Docker, LXC, Bash, Tomcat, PostgreSQL, Azure, Amazon, Google Cloud
Overall, the developed solution allowed our customer to:
Execute jobs in public clouds and on-premises environments
Schedule and dispatch jobs through a REST API and manually via a web interface
Manage job execution order for complex processing pipelines
Manage job priorities to process tasks from paying customers faster
Manage and monitor cloud resource utilization
Plan and manage cloud expenses in a predictable way
Adjust computing resources for every job (CPU cores and RAM)
Report errors if processing fails
Automatically recover from soft failures and restart processing without manual intervention
The developed solution also enabled HIPAA compliance for the most business-critical component.
It's difficult to overstate the importance of speed in business. With the pace at which society progresses, companies have to do whatever it takes to stay relevant.
The platform Plexteq developed unlocks the power of big data in genomics and bioinformatics. Our product handles thousands of concurrent genome processing pipelines, delivering results exceptionally fast to end users and research labs.
Key results achieved are:
High-speed fully automated genome processing pipelines
Major user experience improvement – users started getting results around 10 times faster
Built a cost control and alerting system to monitor cloud computing expenses in real time
With this solution, Plexteq is ready to address our clients’ needs to receive value from big data related to bioinformatics and genomics. Our team of professional big data engineers is in a strong position to implement genome processing platforms with custom functionality tailored for specific business needs.