2022 Blog, Blog, Featured

Sequencing of Bacterial Genomes with Research Gateway in Minutes for Revolutionizing Food Microbiology

April, 2022

- 2022 Blog, Blog, Featured

In recent times, Next Generation Sequencing (NGS) has transformed from being solely a research tool to be routinely applied in many fields, including diagnostics, outbreak disease investigations, antimicrobial resistance, forensics, and food authenticity. The use of cloud and modern open source tools is helping advancement at a rapid pace, with continuous improvement in quality and cost reduction, and is having a major influence on food microbiology. Public health labs and food regulatory agencies globally are embracing Whole Genome Sequencing (WGS) as a revolutionary new method. In this blog, we try to introduce this interesting use case and cover a common use case of Bacterial Genome Analysis in the cloud using our Research Gateway. We will show how to run powerful tools like Bactopia, a Flexible Pipeline for Complete Analysis of Bacterial Genomes in a few minutes.

What is Bactopia?
Sequencing of bacterial genomes is gathering momentum for greater adoption. Bactopia, developed by Robert A. Petit III, was created with a new series of pipelines (acknowledgements) built using Nextflow workflow software to provide efficient comparative genomic analyses for bacterial species or genera. This pipeline has more advanced features than many others in a similar space.

The image below shows the High Level Components of Bactopia Pipeline.

What Makes Bactopia More Powerful Compared to Other Similar Solutions?
The following data shared by the authors of this pipeline highlights the key strengths.

Usually, for researchers to get started with setting up a secure environment, accessing the data, big compute, and analytics tools can be a significant effort. With Research Gateway built on AWS, we make it extremely simple to get started.

An Introduction to Running Bactopia on AWS Cloud
Bactopia is a software pipeline for the complete analysis of bacterial genomes. Bactopia is based on the Nextflow bioinformatic workflow software. Research Gateway supports Nextflow based pipelines to be run with great ease, and we will show you how the same can be achieved with Bactopia.

Steps for Running Bactopia Pipeline on AWS Cloud
Step-1: Using the publicly available Bactopia repository on Github, a new AWS AMI is created by installing Bactopia software on Nextflow advanced product available as part of Research Gateway standard catalog. This step is needed since Bactopia contains a large number of specialized tools integrated and embeds Nextflow internally for its execution. Once the new AMI of Bactopia is ready, it is added to AWS Service Catalog and imported into Research Gateway to be used by Researchers. The product is available in the standard products category to be launched with 1-Click, as shown below.

Step-2: Once the Bactopia product is ordered using a simple screen as shown above in about 10 minutes, the setup with all the tools, Nextflow & Nextflow Tower are all provisioned and ready to be used. The user can log in to the Bactopia server using the SSH key-pair available from within the Portal UI using the “SSH/RDP Connect” action, as shown below.

Step-3: Copy data to the Bactopia server based on samples to be used for processing and start the execution of workflow as per available documentation. In our case, we tried with a smaller set of sample data sets, and it took us 15 min to run the pipeline and view outputs in the console window.

Step-4: When the pipeline is being executed using Nextflow Tower, details of the jobs and all key metrics can be viewed by the user from within the Research Gateway by selecting the “Monitor Pipeline” action. The entire complexity of different tools integrated on the platform is invisible to the user making it a seamless experience.

Step-5: The outputs generated by the Bactopia pipeline can be viewed from within the Portal using the “View Outputs” action that allows users to view the outputs in a simple browser, and actions can be taken to view the same with specialized tools like Integrative Genomics Viewer (IGV) or MultiQC reports, etc.

All the products that are used in Research Gateway have automatic tagging and tracking for cost purposes, and it can be easily verified by project, researchers, and product type on the total consumption providing a powerful cost management and budget tracking tool.

Summary
As the usage of Genomics adoption grows and new use cases emerge for leveraging the power of this technology, focus on food safety is a growing need with an ability to Bacterial Genome analysis using advanced pipelines popularly available in the open source community. To help researchers use such power tools with speed on the cloud without getting into the complexity of infrastructure, networks, security, and focus on science, we have demonstrated in this blog the ability to use Research Gateway to run your first pipeline in less than 60 minutes.

To know more about how you can start your Bacterial Genome analysis pipelines on the AWS Cloud in less than 60 minutes using our solution at https://research.rlcatalyst.com, feel free to contact marketing@relevancelab.com

References
An introduction to running Bactopia on Amazon Web Services (May 2021)
Using AWS Batch to process 67,000 genomes with Bactopia (December 2020)
Accelerating Genomics and High Performance Computing on AWS with Relevance Lab Research Gateway Solution

2022 Blog, Blog, Featured

Jumpstart Virtual Cloud Labs – Data Analytics Cloud Lab Setup in Minutes with Self-Service Access, Security, and Cost Controls

By RL Admin on

March, 2022

- 2022 Blog, Blog, Featured

Research Gateway SaaS solution from Relevance Lab provides a next-generation cloud-based platform for collaborative scientific research on AWS with access to research tools, data sets, processing pipelines, and analytics workbenches in a frictionless manner. It takes less than 30 minutes to launch a “MyResearchCloud” working environment for Principal Investigators and Researchers with security, scalability, and cost governance. Using the Software as a Service (SaaS) model is a preferable option for consuming functionality but in the area of scientific research, it is equally critical to have tight control on data security, privacy, and regulatory compliances.

One of the growing needs from customers is to use the solution for their online training needs and specialized use cases on Bioinformatics courses. With the pandemic, there is tremendous new interest in students to pursue life sciences courses and specialize in Bioinformatics streams. At the same time, education institutions are struggling to move their internal Training Labs infrastructure from data centers to the cloud. As an AWS specialized partner for Higher Education, we are working with a number of universities to understand their needs better and provide solutions to address the same in an easy + cost-effective manner.

The Top-5 use cases shared by customers to set up their Virtual Cloud Labs for courses like Bioinformatics are the following:

Enterprise Needs: Ability to move from Data Center based physical labs to cloud-based Virtual Labs using their Corporate Cloud accounts easily without compromising on security, tight cost controls, and a self-service portal for Instructors and Students. Enterprise-grade controls on Budget, Students/Instructors Access, Data Security, and Approved Products Catalog.
Business Needs: The setup of a New Virtual Training Lab should support the key learning and research needs of the students.

Programs available to provide labs access to students based on calendar programs for the duration of the full semester.
Longer-term projects and programs accessible for labs based on research grants and associated budgets/time constraints.

IT Department Needs: From University Corporate IT to be able to allow specific departments (like Bioinformatics) to have their own Programs and Projects with self-service without compromising on Enterprise Security and Compliance Needs.
Curriculum Department Needs: From different Department Heads (like Bioinformatics) and Instructors be able to define learning curriculum and associated training programs with access to Classroom and Research Labs. Departments also need tight control on budgets and student access management.
Student Needs: The ability for students to access cloud-based Training Labs is a very easy and simple manner without requiring deep access to cloud knowledge. Also having pre-build solutions for basic needs covering Analytics Tools like RStudio/Jupyter, access to secure data repositories, open-source tools/containers access, and collaboration portal.

The following picture describes the basic organization and roles setup in a university.

To balance the needs of speed with compliance, we have designed a unique model to allow Universities to “Bring your own License” while leveraging the benefits of SaaS in a unique hybrid approach. Our solution provides a “Gateway Model” of Hub-n-Spoke design where we provide and operate the “Hub” while enabling universities and their departments to connect their own AWS Research accounts as a “Spoke” and get started within 30 min with full access to a complete Classroom Toolkit. A sample of out-of-the-box Bioinformatics Lab tools available as a standard catalog is shown below.

Professors can add more tools to the standard catalog by importing their own AMIs using AWS Service Catalog. It is also very simple to create new course material and support additional tools using the base building blocks provided out-of-the-box.

Currently, it is not easy for universities, their IT staff, professors, students, and research groups to leverage the cloud easily for their scientific research. There are constraints with on-premise data centers and these institutions have access to Cloud accounts. However converting a basic account to a secure network, secure access, ability to create & publish product/tools catalog, ingress & egress of data, sharing of analysis, enforce tight budget control are non-trivial tasks that divert attention away from education to infrastructure.

Based on our discussions with stakeholders it was clear that the users want something that is as easy to consume as other consumer-oriented activities like e-shopping, consumer banking, etc. This led to the simplified process of creating a “My -Bioinformatics-Cloud-Lab” with the following basic needs:

1. A university can decide to sign up with Research Gateway (SaaS) to enable their different departments for using this software to enable online training and research needs. Such a university-level adoption is recommended to be an enterprise version of the software (hosted by us or by the university themselves) and used for different departments (called Organization or Business Units).
2. Another simpler way is to use our hosted version of Research Gateway by a particular department to create a tenant in Research Gateway with no overheads to maintain a university-specific deployment.
3. A Head of Department (HOD) can sign-up to create a new Tenant on Research Gateway and configure their own AWS Billing account to create Projects. Each Project can then invite other professors to be part of the online Training Labs. Projects can be aligned with semester-based classroom lab needs or can be part of ongoing research projects. Each project has a budget assigned along with associated professors and students, who have access to the project. The figure below shows typical department projects inside the portal.

4. Once the professor selects the project they can see standard “available products” in the Project. This project is used as a basic setup for a Training Lab. The figure below shows the sample screen for the available set of tools Professors can access by default. They can also add new products to the Lab Catalog.

For every Project (Lab) by default shared infrastructure is made available in the form of Project Storage, where curriculum-related data and information can be stored and made available to all students. Also, necessary security aspects for SSL connection, VPC, IAM roles, etc. are setup by default to make sure the Cloud Training Lab has a well-architected design.

5. A professor can control basic parameters for the Lab in terms of adding/deleting users, managing budgets, and also be able to take actions like “Pausing” a Project (no new products can be created while existing ones can be used) or “Stopping” the project (where all existing running machines are force stopped and no new ones can be created, however, data on the storage is accessible by students). The figure below shows how to manage project-level users and budget controls.

6. A professor can track the consumption of the lab resources by all users including professors and students as shown in the figure below.

7. Once a student logs into the project and accesses the lab resources, they can create their own workspaces like Rstudio and interact with the same from within the Portal. Once they are done with their work, they can stop the machine and log out to ensure no costs are being spent while the systems are not being used. When a researcher or student logs in, they can interact with active products and project storage as shown in the figure below.

8. The students can interact with their tools like RStudio from within the portal and connect to the same in a secure manner with a single click as shown in the figure below.

9. On Clicking the “Open Link” action, it allows access to an RStudio familiar environment for students to log in and learn as per their curriculum needs. The figure below shows the standard RStudio environment.

Summary
The new solution from Relevance Lab makes Scientific Research and Training in Cloud very easy for use cases like Bioinformatics. It provides flexibility, cost management, and secure collaborations to truly unlock the potential of the Cloud. For Higher Education Universities, this provides a fully functional Training Lab accessible by professors and students in less than 30 minutes.

If this seems exciting and you would like to know more or try this out do write to us at marketing@relevancelab.com

References
University in a Box – Mission with Speed
Leveraging AWS HPC for Accelerating Scientific Research on Cloud
Enabling Frictionless Scientific Research in the Cloud with a 30 Minutes Countdown Now!

2022 Blog, Blog, Featured

Complex Genomics Analysis Pipelines made Simple with NextFlow & Research Gateway integrated with Cost Tracking and Security

By RL Admin on

March, 2022

- 2022 Blog, Blog, Featured

As a researcher, do you want to get started in minutes to run any complex genomics pipeline with large data sets without worrying about hours to set up the environment, dealing with large data sets availability & storage, security of your cloud infrastructure, and most of all unknown expenses? RLCatalyst makes your life simpler, and in this blog, we will cover how easy it is to use publicly available Genomics pipelines from nf-co.re using Nextflow on your AWS Cloud environment with ease.

There are a number of open-source tools available for researchers driving re-use. However, what Research Institutions and Genomics companies are looking for is a right balance on three key dimensions before adopting cloud in a large scale manner for internal use:

Cost and Budget Governance: Strong focus on Cost Tracking of Cloud resources to track, analyze, control, and optimize budget spends.
Research Data & Tools Easy Collaboration: Principal Investigators and researchers need to focus on data management, governance, and privacy along with analysis and collaboration in real-time without worrying about Cloud complexity.
Security and Compliance: Research requires a strong focus on security and compliance covering Identity management, data privacy, audit trails, encryption, and access management.

To make sure the above functionalities do not slow down researchers from focussing on Science due to complexities of infrastructure, Research Gateway provides the reliable solution by automating cost & budget tracking with safe-guards and providing a simple self-service model for collaboration. We will demonstrate in this blog how researchers can use a vast set of publicly available tools, pipelines and data easily on this platform with tight budget controls. Here is a quick video of the ease with which researchers can get started in a frictionless manner.

nf-co.re is a community effort to collect a curated set of analysis pipelines built using Nextflow. The key aspects of these pipelines are that these pipelines adhere to strict guidelines that ensure they can be reused extensively. These pipelines have following advantages:

Cloud-Ready – Pipelines are tested on AWS after every release. You can even browse results live on the website and use outputs for your own benchmarking.
Portable and reproducible – Pipelines follow best practices to ensure maximum portability and reproducibility. The large community makes the pipelines exceptionally well tested and easy to run.
Packaged software – Pipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or others. No need for any software installations.
Stable releases – nf-core pipelines use GitHub releases to tag stable versions of the code and software, making pipeline runs totally reproducible.
CI testing – Every time a change is made to the pipeline code, nf-core pipelines use continuous integration testing to ensure that nothing has broken.
Documentation – Extensive documentation covering installation, usage, and description of output files ensures that you won’t be left in the dark.

Sample of commonly used pipelines that are supported out-of-box in Research Gateway to run with a few clicks and do important genomic analysis. While publicly available repos are easily accessible, it also allows private repositories and custom pipelines to run with ease.

Pipeline Name	Description	Commonly used for
Sarek	Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling, and annotation) from Whole Genome Sequencing (WGS) / targeted sequencing	Variant Analysis – workflow designed to detect variants on whole genome or targeted sequencing data
RNA-Seq	RNA-Sequencing analysis pipeline using STAR, RSEM, HISAT2, or Salmon with gene/isoform counts and extensive quality control	Common basic analysis for RNA-Sequencing with a reference genome and annotation
Dual RNA-Seq	Analysis of Dual RNA-Seq data – an experimental method for interrogating host-pathogen interactions through simultaneous RNA-Seq	Specifically used for the analysis of Dual RNA-Seq data, interrogating host-pathogen interactions through simultaneous RNA-Seq
Bactopia	Bactopia is a flexible pipeline for complete analysis of bacterial genomes	Bacterial Genomic Analysis with focus on Food Safety
Viralrecon	Assembly and intrahost/low-frequency variant calling for viral samples	Supports metagenomics and amplicon sequencing data derived from the Illumina sequencing platform

*The above samples can be launched in less than 5 min and take less than $5 to run with test data and 80% productivity gains achieved.

The figure below shows the building block of this solution on AWS Cloud.

Steps for running nf-core pipeline with Nextflow on AWS Cloud

Steps	Details	Time Taken
1.	Log into RLCatalyst Research Gateway as a Principal Investigator or Researcher profile. Select the project for running Genomics Pipelines, and first time create a new Nextflow Advanced Product.	5 min
2.	Select the Input Data location, output data location, pipeline to run (from nf-co.re), and provide parameters (container path, data pattern to use, etc.). Default parameters are already suggested for use of AWS Batch with Spot instances and all other AWS complexities abstracted from end-user for simplicity.	5 min to provision new Nextflow & Nextflow Tower Server on AWS with AWS Batch setup completed with 1-Click
3.	Execute Pipeline (using UI interface or by SSH into Head-node) on Nextflow Server. There is ability to run the new pipelines, monitor status, and review outputs from within the Portal UI.	Pipelines can take some time to run depending on the size of data and complexity
4.	Monitor live pipelines with the 1-Click launch of Nextflow Tower integrated with the portal. Also, view outputs of the pipeline in outputs S3 bucket from within the Portal. Use specialized tools like MultiQC, IGV, and RStudio for further analysis.	5 min
5.	All costs related to User, Product, and Pipelines are automatically tagged and can be viewed in the Budgets screen to know the Cloud spend for pipeline execution that includes all resources, including AWS Batch HPC instances dynamically provisioned. Once the pipelines are executed, the existing Cromwell Server can be stopped or terminated to reduce ongoing costs.	5 min

The figure below shows the Nextflow Architecture on AWS.

Summary
nf-co.re community is constantly striving to make Genomics Research in the Cloud simpler. While these pipelines are easily available, running them on AWS Cloud with proper cost tracking, collaboration, data management, and integrated workbench were missing that is now solved by Research Gateway. Relevance Lab, in partnership with AWS, has addressed this need with their Genomics Cloud solution to make scientific research frictionless.

To know more about how you can start your Nextflow nf-co.re pipelines on the AWS Cloud in 30 minutes using our solution at https://research.rlcatalyst.com, feel free to contact marketing@relevancelab.com

References
Enabling Researchers with Next-Generation Sequencing (NGS) Leveraging Nextflow and AWS
Pipelining GATK with WDL and Cromwell on AWS Cloud
Genomics Cloud on AWS with RLCatalyst Research Gateway
Health Informatics and Genomics on AWS with RLCatalyst Research Gateway
Accelerating Genomics and High Performance Computing on AWS with Relevance Lab Research Gateway Solution

2022 Blog, SWB Blog, Blog, Featured

Helping Customers with Onboarding, Customization and Ongoing Support for Scientific Research Leveraging Open-Source Solutions

By RL Admin on

March, 2022

- 2022 Blog, SWB Blog, Blog, Featured

Relevance Lab launches its professional services for Service Workbench on AWS (SWB) available for customers through AWS Marketplace. SWB is a cloud-based open-source solution that caters the needs of the scientific research community by empowering both researchers & research IT teams.

Relevance Lab is a preferred partner for SWB to help customers adopt this open-source solution seamlessly. We have deep expertise and can help in assessment, planning, deployment, training, customization and ongoing managed services support in a cost effective manner.

Highlights of Professional Services Offering

Service Workbench on AWS which is an the open-source solution is fully supported with deep competence to help Plan-Build-Run lifecycle
Provide assessment, planning, deployment, training, customization and ongoing managed services support
Offer cost-effective and flexible engagement models

With Relevance Lab’s professional services for SWB, IT teams are able to deliver secure, repeatable, and federated access control to data, tooling, and compute power to researchers driving a frictionless scientific research on cloud.

Key Offerings

Assessment, Implementation and Training for new and existing setup
Advanced Setup & Premium Support including underlying infrastructure with special needs on Security, Compliance, Data Protection and Scalability
Ongoing Managed Services & Support including Upgrades, Monitoring and Incident Management
SWB Code and new feature customization, enhancement services for custom catalog like RStudio on ALB

What it Means for Scientific Researcher Community?
Relevance Lab’s Professional Services Offering for Service Workbench on AWS is a solution that enables IT teams to provide secure, repeatable, and federated control of access to data, tooling, and compute power that researchers need. With Service Workbench, researchers no longer have to worry about navigating cloud infrastructure. They can focus on achieving research missions and completing essential work in minutes, not months, in configured research environments.

Frequently Asked Questions

Question-1  How to get started using SWB and RStudio with ALB?
Answer:  We have a dedicated landing page, sign-up page and support model

Question-2  What is a typical customer end-to-end journey?
Answer:  Most customers look for the following support for the adoption lifecycle.

One time on-boarding
Product customization services
On-Going managed services and support
T&M services for anything additional

Question-3  How long does onboarding take, and what does it cost?
Answer:  A standard onboarding for a new customer takes about 2 weeks covering initial assessment, installation, configurations, training, and basic functionality demonstration for a new setup. It costs about US $10,000.

Question-4  What sort of support is available post onboarding?
Answer:  Following are the common support activities requested:

L0 – Monitoring and Diagnostics
L1 – Technical Queries on how to use the product effectively
L2 – Ongoing upgrades, troubleshooting, configurations
L3 – Customization, enhancements (typically for less than 40-hour changes per request)
Project Engagement – for typically 40+ hours of enhancements/customization work

Question-5 What is the engagement model for ongoing support or customizations?
Answer: Two models of support are offered – Basic and Premium. In case of customizations, both models of project-based and Time & Material engagement are possible.

Looking Ahead
SWB is available as an open-source solution and provides useful functionality to enable self-service portal for research customers. However, without a dedicated partner to support through the complete lifecycle, it can be a daunting exercise for customers and overheads for their internal IT teams. Based on the feedback from early adopters and in partnership with AWS, we are happy to launch specialized professional services on AWS Marketplace to make adoption by customers frictionless. Keeping the open source nature in mind, the services are optimized to be cost-effective and flexible with a goal to make scientific research in the cloud faster, cheaper and better.

To learn more about Relevance Lab’s professional services for Service Workbench, feel free to write to marketing@relevancelab.com

References
Relevance Lab Open-Source Collaboration with Service Workbench on AWS
Service Workbench Template on Github

2022 Blog, Blog, Featured

Pipelining GATK with WDL and Cromwell on AWS Cloud

By RL Admin on

February, 2022

- 2022 Blog, Blog, Featured

Developed in the Data Sciences Platform at the Broad Institute, the Genome Analysis Toolkit (GATK) offers a wide variety of tools with a primary focus on variant discovery and genotyping. Relevance Lab is pleased to offer researchers the ability to run their GATK pipelines on AWS that was missing so far with our Genomics Cloud solution and a 1-click model.

GATK is making scientific research simpler for Genomics by providing best practices workflows and docker containers. The workflows are written in Workflow Description Language (WDL), a user-friendly scripting language maintained by the OpenWDL community. Cromwell is an open-source workflow execution engine that supports WDL as well as CWL, the Common Workflow Language, and can be run on a variety of different platforms, both local and cloud-based. RLCatalyst Research Gateway added support for the Cromwell engine that enables researchers to run any popular workflows on AWS seamlessly. Some of the popular workflows that are available for a quick start are the following:

The figure below shows the building block of this solution on AWS Cloud.

Steps for running GATK with WDL and Cromwell on AWS Cloud

Steps	Details	Time Taken
1.	Log into RLCatalyst Research Gateway as a Principal Investigator or Researcher profile. Select the project for running Genomics Pipelines, and first time create a new Cromwell Advanced Product.	5 min
2.	Select the Input Data location, output data location, pipeline to run (from GATK), and provide parameters (input.json). Default parameters are already suggested for the use of AWS Batch with Spot instances and all other AWS complexities, abstracted from the end-user, for simplicity.	5 min to provision new Cromwell Server on AWS with AWS Batch setup completed with 1-Click
3.	Execute Pipeline (using UI interface or by SSH into Head-node) on Cromwell Server. There is ability to run the new pipelines, monitor status, and review outputs from within the Portal UI.	Pipelines can take some time to run depending on size of data and complexity
4.	View outputs of the Pipeline in Outputs S3 bucket from within the Portal. Use specialized tools like MultiQC, Integrative Genomics Viewer (IGV), and RStudio for further analysis.	5 min
5.	All costs related to User, Product, and Pipelines are automatically tagged and can be viewed in the budgets screen to know the cloud spend for pipeline execution that consists of all resources, including AWS Batch HPC instances dynamically provisioned. Once the pipelines are executed, the existing Cromwell Server can be stopped or terminated to reduce ongoing costs.	5 min

The figure below shows the ability to select Cromwell Advanced to provision and run any pipeline.

The following picture shows the architecture of Cromwell on AWS.

Summary
GATK community is constantly striving to make Genomics Research in the cloud simpler. So far, the support for AWS Cloud was still missing and was a key ask from multiple online research communities. Relevance Lab, in partnership with AWS, has addressed this need with their Genomics Cloud solution to make scientific research frictionless.

To know more about how you can start your GATK pipelines with WDL and Cromwell on the AWS Cloud in just 30 minutes using our solution at https://research.rlcatalyst.com, feel free to write to marketing@relevancelab.com

References
Accelerating Analytics for the Future of Genomics
Cromwell on AWS
Leveraging AWS HPC for Accelerating Scientific Research on Cloud
Cromwell Documentation
Artificial Intelligence, Machine Learning and Genomics
Accelerating Genomics and High Performance Computing on AWS with Relevance Lab Research Gateway Solution

2022 Blog, Blog, Featured

Genomics Cloud on AWS with RLCatalyst Research Gateway

By RL Admin on

February, 2022

- 2022 Blog, Blog, Featured

The pandemic worldwide has highlighted the need for advancing human health faster and new drugs discovery advancement for precision medicines leveraging Genomics. We are building a Genomics Cloud on AWS leveraging convergence of Big Compute, Large Data Sets, AI/ML Analytics engines, and high-performance workflows to make drug discovery more efficient, combining cloud & open source with our products.

Relevance Lab (RL) has been collaborating with AWS Partnership teams over the last one year to create Genomics Cloud. This is one of the dominant use cases for scientific research in the cloud, driven by healthcare and life sciences groups exploring ways to make Genomics analysis better, faster, and cheaper so that researchers can focus on science and not complex infrastructure.

RL offers a product RLCatalyst Research Gateway that facilitates Scientific Research with easier access to big compute infrastructure, large data sets, powerful analytics tools, a secure research environment, and the ability to drive self-service research with tight cost and budget controls.

The top use cases for AWS Genomics in the Cloud are implemented by this product and provide an out-of-the-box solution, significantly saving cost and effort for customers.

Key Building Blocks for Genomics Cloud Architecture
The solution for supporting easy use of Genomics Cloud supports the following key components to meet the need of researchers, scientists, developers, and analysts to efficiently run their experiments without the need for deep expertise in the backend computing capabilities.

Genomics Pipeline Processing Engine
The researchers’ community uses popular open-source tools like NextFlow and Cromwell for large data sets by leveraging HPC systems, and the orchestration layer is managed by tools like Nextflow and Cromwell.

Nextflow is a bioinformatics workflow manager that enables the development of portable and reproducible workflows. It supports deploying workflows on a variety of execution platforms, including local, HPC schedulers, AWS Batch, Google Cloud Life Sciences, and Kubernetes.

Cromwell is a workflow execution engine that simplifies the orchestration of computing tasks needed for Genomics analysis. Cromwell enables Genomics researchers, scientists, developers, and analysts to efficiently run their experiments without the need for deep expertise in the backend computing capabilities.

Many organizations also use commercial tools like Illumina DRAGEN and NVidia Parabricks for similar solutions that are more optimized in reducing processing timelines but also come with a price.

Open Source Repositories for Common Genomics Workflows
The solution needs to allow researchers to leverage work done by different communities and tools to reuse existing available workflows and containers easily. Researchers can leverage any of the existing pipelines & containers or can also create their own implementations by leveraging existing standards.

GATK4 is a Genome Analysis Toolkit for Variant Discovery in High-Throughput Sequencing Data. Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

BioContainers – A community-driven project to create and manage bioinformatics software containers.

Dockstore – Dockstore is a free and open source platform for sharing reusable and scalable analytical tools and workflows. It’s developed by the Cancer Genome COLLABORATORY and used by the GA4GH.

nf-core Pipelines – A community effort to collect a curated set of analysis pipelines built using Nextflow.

Workflow Description Language (WDL) is a way to specify data processing workflows with a human-readable and -writeable syntax.

AWS Batch for High Performance Computing
AWS has many services that can be used for Genomics. In this solution, the core architecture is with AWS Batch, a managed service that is built on top of other AWS services, such as Amazon EC2 and Amazon Elastic Container Service (ECS). Also, proper security is provided with Roles via AWS Identity and Access Management (IAM), a service that helps you control who is authenticated (signed in) and authorized (has permissions) to use AWS resources.

Large Data Sets Storage and Access to Open Data Sets
AWS cloud is leveraged to deal with the needs of large data sets for storage, processing, and analytics using the following key products.

Amazon S3 for high-throughput data ingestion, cost-effective storage options, secure access, and efficient searching

AWS DataSync for secure, online service that automates and accelerates moving data between on premises and AWS storage services

AWS Open Datasets Program houses openly available, with 40+ open Life Sciences data repositories

Outputs Analysis and Monitoring Tools
One of the key building blocks for Genomic Data Analysis needs access to common tools like the following integrated into the solution.

MultiQC reports MultiQC searches a given directory for analysis logs and compiles an HTML report. It’s a general-use tool, perfect for summarising the output from numerous bioinformatics tools.

IGV (Integrative Genomics Viewer) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.

RStudio for Genomics since R is one of the most widely-used and powerful programming languages in bioinformatics. R especially shines where a variety of statistical tools are required (e.g., RNA-Seq, population Genomics, etc.) and in the generation of publication-quality graphs and figures.

Genomics Data Lake
AWS Data Lake for creating Genomics data lake for tertiary processing. Once the Secondary analysis generates outputs typically in Variant Calling Format (VCF) for further analysis, there is a need to move such data into a Genomics Data Lake for tertiary processing. Leveraging standard AWS tools and solution framework, a Genomics Data Lake is implemented and integrated with the end-to-end sequencing processing pipeline.

Variant Calling Format specification is used in bioinformatics for storing gene sequence variations, typically in a compressed text file. According to the VCF specification, a VCF file has meta-information lines, a header line, and data lines. Compressed VCF files are indexed for fast data retrieval (random access) of variants from a range of positions.

VCF files, though popular in bioinformatics, are a mixed file type that includes a metadata header and a more structured table-like body. Converting VCF files into the Parquet format works excellently in distributed contexts like a Data Lake.

Cost Analysis of Workflows
One of the biggest concerns for users of Genomic Cloud is control on budget and cost that is provided by RLCatalyst Research Gateway by tracking spends across Projects, Researchers, Workflow runs at a granular level and allowing for optimizing spends by using techniques like Spot instances and on-demand compute. There are guardrails built-in for appropriate controls and corrective actions. Users can run sequencing workflows using their own AWS Accounts, allowing for transparent control and visibility.

Summary
To make large-scale genomic processing in the cloud easier for institutions, principal investigators, and researchers, we provide the fundamental building blocks for Genomics Cloud. The integrated product covers large data sets access, support for popular pipeline engines, access to open source pipelines & containers, AWS HPC environments, analytics tools, and cost tracking that takes away the pains of managing infrastructure, data, security, and costs to enable researchers to focus on science.

To know more about how you can start your Genomic Cloud in the AWS cloud in 30 minutes using our solution at https://research.rlcatalyst.com, feel free to contact marketing@relevancelab.com.

References
High-performance genetic datastore on AWS S3 using Parquet and Arrow
Parallelizing Genome Variant Analysis
Pipelining GATK with WDL and Cromwell
Accelerating Genomics and High Performance Computing on AWS with Relevance Lab Research Gateway Solution

2022 Blog, Blog, Featured

Architecting a Cloud-based Application with AWS Best Practices

By RL Admin on

January, 2022

- 2022 Blog, Blog, Featured

Software architecture provides a high-level overview of what a software system looks like. At the very minimum, it shows the various logical pieces of the overall solution and the interaction between those pieces. (See C4 Model for architecture diagramming). The software architecture is like a map of the terrain for anybody who must deal with the system. Contrary to what many might think, software architecture is important even for non-engineering functions like sales, as many customers like to review the architecture to see how well it fits within their enterprise and whether it could introduce future issues by its adoption.

Goals of the Architecture
It is important to determine the goals for the system when deciding on the architecture. This should include both short-term and long-term goals.

Some of our important goals for RLCatalyst Research Gateway are:
1. Ease of Use
The basic question in our mind is always “How would customers like to use this system?”. Our product is targeted to researchers and academics who want to use the scalability and elasticity of the AWS cloud for ad-hoc and high-performance computing needs. These users are not experts at using the AWS console. So, we made things extremely simple for the user. Researchers can order products with a single click, and the portal sets up their resources without the user needing to understand any of the underlying complexities. Users can also interact with the products through the portal, eliminating the need to set up anything outside the portal (though they always have that option).

We also kept in mind the administrators of the system for whom this might just be one amongst many others that they must manage. Thus, we made it easy for the administrator to add AWS accounts, create Organizational Units, and integrated Identity Providers. Our goals were: administrators to get the system up and running in less than 30 minutes.

2. Scalability, performance, and reliability
We followed the best practices recommended by AWS, and where possible, used standardized architecture models so that users would find it easy as well as familiar. For example, we deploy our system into a VPC with public and private subnets. The subnets are spread across multiple Availability Zones to guard against the possibility of one availability zone going down. The computing instances are deployed in the private subnet to prevent unauthorized access. We also use auto-scaling groups for the system to be able to pull in additional compute instances when the load is higher.

3. What is the time to market?
One of our main goals was to be able to bring the product to market quickly and put it in front of the customers to gain early and valuable feedback. Developing the product as a partner of AWS was a great help since we were able to use many AWS services for some of the common application needs without spending time in developing our own components for well known use-cases. For example, RLCatalyst Research Gateway does its user management via AWS Cognito, which provides the facility to create users, roles, and groups as well as the ability to interface with other Identity Provider systems.

Similarly, we use AWS DocumentDB (with MongoDB API compatibility) as our database. This allows developers to use a local MongoDB instance, while QA and production systems use AWS DocumentDB with high availability of multi-AZ clusters, automated backups via AWS Backup and Snapshots.

4. Cost efficiency
This is one of the key concerns for every administrator. RLCatalyst Research Gateway uses a scalable architecture that not only lets the system scale up when the load is high but also scales down when the load is less to optimize on the cost. We use EKS clusters to deploy our solution and AWS DocumentDB clusters. This allows us to choose the size and instance type according to the cost considerations.

We have also brought in features like the automatic shutdown of resources so that idle compute instances, which are not running any jobs, can shut down after a 15-minute idle time. Additionally, even resources like ALBs are de-provisioned when the last compute instance behind them is de-provisioned.

We provide a robust cost governance dashboard, allowing users insights into their usage and budget consumption.

5. Security
Our target customers are in the research and scientific computing area, where data security is a key concern. We are frequently asked, “Will the system be secure? Can it help me meet regulatory requirements and compliances?”. RLCatalyst Research Gateway architecture is developed with security in mind at each level. The use of SSL certificates, encryption of data at rest, and the ability to initiate action at a distance are some of the architecture considerations.

Map of AWS Services

AWS Service	Purpose	Benefits
Amazon EC2, Auto-scaling	Elastic Compute	Provides easily managed compute resources without need to manage hardware. Integrates well with Infrastructure as Code (IaC)
Amazon Virtual Private Cloud (VPC)	Networking	Provides isolation of resources, easy management of traffic, isolation of traffic.
Application Load Balancer, AWS Certificate Manager	Load-balancer, Secure end-point	Provides an easy way to provide a single end-point which can route traffic to multiple target groups. Integrates with AWS Certificate manager to provide SSL support.
AWS CostExplorer, AWS Budgets	Cost and Governance	Provides fine-grained cost and usage data. Notifications when budget thresholds are reached.
AWS Service Catalog	Catalog of approved IT Services on AWS	Provides control on what resources can be used in an AWS account.
AWS WAF (Web Application Firewall)	Application Firewall	Helps manage malicious traffic
Amazon Route53	DNS (Domain Name System) Services	Provides hosted zones and API access to manage the same.
Amazon Cloudfront	CDN (Content Delivery Network)	Caches content closest to end-users to reduce latency and improve customer experience.
AWS Cognito	User Management	Authentication and authorization
AWS Identity and Access Management (IAM)	Identity Management	Provides support for granular control based on policies and roles.
AWS DocumentDb	NoSQL database	MongoDB compatible API

Validation of the Solution
It is always good to validate your solution with an external review from the experts. AWS offers such an opportunity to all its partners by way of the AWS Foundational Technical Review. The review is valid for two years and is free of cost to partners. Looking at our design through the FTR Lens enabled us to see where our design could get better in terms of using the best practices (especially in the areas of security and cost-efficiency). Once these changes were implemented, we earned the “Reviewed by AWS” badge.

Summary
Relevance Lab developed the RLCatalyst Research Gateway in close partnership with AWS. One of the excellent tools available from AWS for any software architecture team is the AWS Well-Architected Framework with its five pillars of Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Efficiency. Working within this framework greatly facilitates the development of a robust architecture that serves not only current but also future goals.

To know more about RLCatalyst Research Gateway architecture, feel free to write to marketing@relevancelab.com.

References
How to speed up the GEOS-Chem Earth Science Research using AWS Cloud?
Driving Frictionless Scientific Research on AWS Cloud
Leveraging AWS HPC for Accelerating Scientific Research on Cloud
Health Informatics and Genomics on AWS with RLCatalyst Research Gateway
Enabling Researchers with Next-Generation Sequencing (NGS) Leveraging Nextflow and AWS
8-Steps to Set-Up RLCatalyst Research Gateway

NO OLD POSTSPage 2 of 2NEXT POSTS

2022 Blog, Blog, Featured

Sequencing of Bacterial Genomes with Research Gateway in Minutes for Revolutionizing Food Microbiology

2022 Blog, Blog, Featured

Jumpstart Virtual Cloud Labs – Data Analytics Cloud Lab Setup in Minutes with Self-Service Access, Security, and Cost Controls

2022 Blog, Blog, Featured

Complex Genomics Analysis Pipelines made Simple with NextFlow & Research Gateway integrated with Cost Tracking and Security

2022 Blog, SWB Blog, Blog, Featured

Helping Customers with Onboarding, Customization and Ongoing Support for Scientific Research Leveraging Open-Source Solutions

2022 Blog, Blog, Featured

Pipelining GATK with WDL and Cromwell on AWS Cloud

2022 Blog, Blog, Featured

Genomics Cloud on AWS with RLCatalyst Research Gateway

2022 Blog, Blog, Featured

Architecting a Cloud-based Application with AWS Best Practices

About

Recent News

A Pre-built Platform for Site Reliability Engineering Implementation at Scale

Powering Frictionless Scientific Research for Better Human Lives with Research Gateway

Streamline SOX Compliance with Automation using RLCatalyst and RPA Solutions

Early Adoption of GenAI with Security and Data Privacy using Azure OpenAI for Customers

Platform

Services

Frictionless Business

Resources

Quick Links

Policies