Your address will show here +12 34 56 78
2024 Blog, Blog, Featured

With growing complexity across Cloud Infrastructure, Applications deployments, and Data pipelines there is a critical need for effective Site Reliability Engineering (SRE) solutions for large enterprises. While this is a common need, the approach many companies seem to be attempting is solving the problem with multiple siloed efforts. Following are some common problems we have observed:

  1. Data Center oriented approach to solving SRE adoption with a focus on different layers across Storage, Compute, Network, Apps, and Databases. This approach creates silos with different teams working across the tracks.
  2. Significant focus on reactive models for Observability with multiple tools and overlapping monitoring coverage across on-prem & cloud systems. This creates alerts fatigue, false alarms, long diagnostics time, and slow recovery cycles.
  3. Long planning and analysis cycles on defining how to get started on “SRE Transformation Program” with multiple groups, approaches, and discovery cycles.
  4. Need for Organization clarity on who should drive SRE program across Infra, Apps, DevOps, Data and Service Delivery groups.
  5. Undefined roadmap for maturity and how to leverage Cloud, Automation, and AIOps to roll-out SRE programs at scale across the enterprise.
  6. Federated models of IT and Business Units with shared responsibility across global operations and how to balance the need for standardization vs self-service flexibility.
  7. Missing information on current problems and faults affecting end users with slow response times, surprise outages, unpredictable performance, and view on real time Business Performance Metrics (SLAs).
  8. Lack of mature Critical Incident Management and Incident Intelligence.
  9. Custom approach to solutions lacking ability to build a common framework and scale across different units.
  10. Need for Machine Learning Observability including data collection and alerting, additional data growth, data drift and consumption monitoring.
  11. Tracking Platform Cost visibility across business, regions, and projects.

With a growing Cloud footprint adoption, these issues have got amplified along with concerns on costs and security in the absence of mature SRE models slowing down digital transformation efforts.

To fix these issues in a prescriptive manner, Relevance Lab has worked with some large customers to evolve a “Platform Centric” model to SRE adoption. This leverages common tools and open-source technologies that can speed up SRE implementation by saving significant time, cost, and efforts. Also, with a rapid deployment model the rollout can be done across a global enterprise with Automation driven templates.

The figure below explains the Command Centre SRE Platform from Relevance Lab.



Building Blocks of SRE Platform

  1. Application Centric Design
    • The first step towards building a mature SRE implementation starts with an application centric view aligned to business services. By using platforms like ServiceNow, we can build relationships or service maps between Infrastructure and application services. This is crucial and helps during an outage in identification of root cause.
    • Once all assets are identified, segregated based on type of applications, business services tagged and managed centrally.

  2. Monitoring
    • The next step is to have monitoring sensors enabled for all business-critical systems. Enablement of monitoring sensors could vary based on the type of resources as mentioned below:
      • Systems Monitoring: This is typically Infrastructure and Network monitoring and could be enabled either using the native cloud services or using third party tools like AWS CloudWatch, Azure Monitor, Solarwinds, Zabbix etc.
      • Applications or Logs Monitoring: Application monitoring involves both performance monitoring as well as logs monitoring, this can also be achieved using the cloud native tools or third-party tools like AppDynamics, ELK, Splunk, AWS X-ray, Azure application insights etc.
      • Jobs Monitoring: For monitoring scheduled jobs, tools like NewRelic, Dynatrace, Control-M etc, are used.

  3. SRE Approach with Event Management
    • Now that the monitoring sensors are enabled, this will generate a lot of alerts and most of this would be noise including false alarms and duplicate alerts. Relevance Lab algorithms help de-duplication, alert aggregation, and alert correlation of these alerts and thereby reduce alert fatigue.
    • Golden Signals: The golden signals namely latency, traffic, errors, and saturation are defined, configured and setup for any abnormalities during this stage. By integrating these with the standard Incident Management and Problem Management process and ITSM Platforms, the application stability and reliability becomes matured over time.
    • Observability Dashboards: Having a single pane of glass view across your environment gives you visibility of your Business Apps. Relevance Lab SRE implementation involves below dashboard as a standard out of box.
      • Infrastructure Dashboard
      • Application Dashboard
      • Program Dashboard (Grafana)
      • Program Dashboard (ServiceNow)

The figure below shows the SRE Dashboard in detail.



How can new customers benefit from our SRE Platform?
In today’s fast-paced and technology-driven world, organizations need robust and efficient IT operations to stay ahead of the competition. Relevance Lab’s SRE solution provides the necessary tools and frameworks to unlock operational excellence, ensuring high availability, scalability, and reliability of critical business systems. With our SRE solution, organizations can focus on innovation and growth, confident in the knowledge that their IT infrastructure is well-managed and optimized for exceptional performance.

Summary
Relevance Lab is a specialist in SRE implementation and helps organizations achieve reliability and stability with SRE execution. While Enterprises can try and build some of these solutions, it is a time-consuming activity and error-prone and needs a specialist partner. We realize that each large enterprise has a different context-culture-constraint model covering organization structures, team skills/maturity, technology, and processes. Hence the right model for any organization will have to be created as a collaborative model, where Relevance Lab will act as an advisor to Plan, Build and Run the SRE model.

For more details, please feel free to reach out to marketing@relevancelab.com

References
Site Reliability Engineering Ensures Digital Transformation Promises are Delivered to End-Users
What is Site Reliability Engineering (SRE) – Google Definition?
Site reliability engineering documentation

0

2024 Blog, Blog, Featured

Our goal, at Relevance Lab (RL), is to make scientific research in the cloud ridiculously simple for researchers and principal investigators. Cloud is driving major advancements in both Healthcare and Higher Education sectors. Rapidly being adopted by various organizations across these sectors in both commercial and public sector segments, research on the cloud is improving day-to-day lives with drug discoveries, healthcare breakthroughs, innovation of sustainable solutions, development of smart and safe cities, etc.

Powering these innovations, public cloud provides an infrastructure with more accessible and useful research-specific products that speed time to insights. Customers get more secure and frictionless collaboration capabilities across large datasets. However, setting up and getting started with complex research workloads can be time-taking. Researchers often look for simple and efficient ways to run their workloads.

RL addresses this issue with Research Gateway, a self-service cloud portal that allows customers to run secure and scalable research on the public clouds without any heavy-lifting of set-ups. In this blog, we will explore different use cases that simplify their workloads and accelerate their outcomes with Research Gateway. We will also elaborate on two specific use cases from the healthcare and higher education sector for the adoption of Research Gateway Software as a Service (SaaS) model.

Who Needs Scientific Research in the Cloud?
The entire scientific community is trying to speed up research for better human lives. While scientists want to focus on “science” and not “infrastructure”, it is not always easy to have a collaborative, secure, self-service, cost-effective, and on-demand research environment. While most customers have traditionally used on-premise infrastructure for research, there is always a key constraint on scaling up with limited resources. Following are some common challenges we have heard our customers say:

  • We have tremendous growth of data for research and are not able to manage with existing on-premise storage.
  • Our ability to start new research programs despite securing grants is severely limited by a lack of scale with existing setups.
  • We have tried the cloud but especially with High Performance Computing (HPC) systems are not confident about total spends and budget controls to adopt the cloud.
  • We have ordered additional servers, but for months, we have been waiting for the hardware to be delivered.
  • We can easily try new cloud accounts but bringing together Large Datasets, Big Compute, Analytics Tools, and Orchestration workflows is a complex effort.
  • We have built on-premise systems for research with Slurm, Singularity Containers, Cromwell/Netflow, custom pipelines and do not have the bandwidth to migrate to the cloud with updated tools and architecture.
  • We want to provide researchers the ability to have their ephemeral research tools and environments with budget controls but do not know how to leverage the cloud.
  • We are scaling up online classrooms and training labs for a large set of students but do not know how to build secure and cost-effective self-service environments like on-premise training labs.
  • We are requiring a data portal for sharing research data across multiple institutions with the right governance and controls on the cloud.
  • We need an ability to run Genomics Secondary Analysis for multiple domains like Bacterial research and Precision Medicines at scale with cost-effective per sample runs without worrying about tools, infrastructure, software, and ongoing support.

Keeping the above common needs in perspective, Research Gateway is solving the problems for the following key customer segments:

  • Education Universities
  • Healthcare Providers
    • Hospitals and Academic Medical Centers for Genomics Research
  • Drug Discovery Companies
  • Not-for-Profit Companies
    • Primarily across health, education, and policy research
  • Public Sector Companies
    • Looking into Food Safety, National Supercomputing centers, etc.

The primary solutions these customers are seeking from Research Gateway have been mentioned below:

  1. Analytics Workbench with tools like RStudio and Sagemaker
  2. Bioinformatics Containers and Tools from the standard catalog and bring your own tools
  3. Genomics Secondary Analysis in Cloud with 1-Click models using open source orchestration engines like Nextflow, Cromwell and specialized tools like DRAGEN, Parabricks, and Sentieon
  4. Virtual Training Labs in Cloud
  5. High Performance Computing Infrastructure with specialized tools and large datasets
  6. Research and Collaboration Portal
  7. Training and Learning Quantum Computing

The figure below shows the customer segments and their top use cases.



How Research Gateway is Powering Frictionless Outcomes?
Research Gateway allows researchers to conduct just-in-time research with 1-click access to research-specific products, provision pipelines in a few steps, and take control of the budget. This helps in the acceleration of discoveries and enables a modern study environment with projects and virtual classrooms.

Case Study 1: Accelerating Virtual Cloud Labs for the Bioinformatics Department of Singapore-based Higher Education University
During interaction with the university, the following needs were highlighted to the RL team by the university’s bioinformatics department:

Classroom Needs: Primary use case to enable Student Classrooms and Groups for learning Analytics, Genomics Workloads, and Docker-based tools

Research Needs: Used by a small group of researchers pursuing higher degrees in Bioinformatics space

Addressing the Virtual Classroom and Research Needs with Research Gateway
The SaaS model of Research Gateway is used with a hub-and-spoke architecture that allows customers to configure their own AWS accounts for projects to control users, products, and budgets seamlessly.

The primary solution includes:

  • Professors set up classrooms and assign students for projects based on semester needs
  • Usage of basic tools like RStudio, EC2 with Docker, MySQL, Sagemaker
  • Special ask of forwarding and connecting port to shared data on cloud-based for local RStudio IDE was also successfully put to use
  • End-of-day automated reports to students and professors on server “still running” for cost optimization
  • Ability to create multiple projects in a single AWS Account + Region for flexibility
  • Ability to assign and enforce student-level budget controls to avoid overspending

Case Study 2: Driving Genomics Processing for Cancer Research of an Australian Academic Medical Center
While the existing research infrastructure is for on-premise setup due to security and privacy needs, the team is facing serious challenges with growing data and the influx of new genomics samples to be processed at scale. A team of researchers is taking the lead in evaluating AWS Cloud to solve the issues related to scale and drive faster research in the cloud with in-build security and governance guardrails.

Addressing Genomic Research Cloud Needs with Research Gateway
RL addressed the genomics workload migration needs of the hospital with the Research Gateway SaaS model using the hub-and-spoke architecture that allows the customer to have exclusive access to their data and research infrastructure by bringing their one AWS account. Also, the deployment of the software is in the Sydney region, complying with in-country data norms as per governance standards. Users can easily configure AWS accounts for genomics workload projects. They also get access to genomic research-related products in 1-click along with seamless budget tracking and pausing.

The following primary solution patterns were delivered:

  • Migration of existing HPC system using Slurm Workload Manager and Singularity Containers
  • Using Cromwell for Large Scale Genomic Samples Processing
  • Using complex pipelines with a mix of custom and public WDL pipelines like RNA-Seq
  • Large Sample and Reference Datasets
  • AWS Batch HPC leveraged for cost-effective and scalable computing
  • Specific Data and Security needs met with country-level data safeguards & compliance
  • Large set of custom tools and packages

The workload currently operates in an HPC environment on-premise, using slurm as the orchestrator and singularity containers. This involves converting singularity containers to docker containers so that they can be used with AWS Batch. The pipelines are based on Cromwell, which is one of the leading workflow orchestrator software available from the Broad Institute. The following picture shows the existing on-premise system and contrasts that with the target cloud-based system.



Case Study 3: Secure Research Environments for US based Academic Medical Centre
Secure Research Environments (SRE) provide researchers with timely and secure access to sensitive research data, computation systems, and common analytics tools for speeding up Scientific Research in the cloud. Researchers are given access to approved data, enabling them to collaborate, analyze data, share results within proper controls and audit trails. Research Gateway provides this secure data platform with analytical and orchestration tools to support researchers in conducting their work. Their results can then be exported safely, with proper workflows for submission reviews and approvals.

Addressing Secure Research Needs for Senstive Data with Ingress/Egress Controls
RL addressed the SRE needs for a US based Academic Medical Centre with HIPAA Compliant research for Health Sciences group. There are the following key building blocks for the solution:

  • Data Ingress/Egress
  • Researcher Workflows & Collaborations with costs controls
  • On-going Researcher Tools Updates
  • Software Patching & Security Upgrades
  • Healthcare (or other sensitive) Data Compliances
  • Security Monitoring, Audit Trail, Budget Controls, User Access & Management

The figure below shows implementation of SRE solution with Research Gateway.



Conclusion
Relevance Lab, in partnership with public cloud providers, is driving frictionless outcomes by enabling secure and scalable research leveraging Research Gateway for various use-cases. By simplifying the setting up and running research workloads in a seamless manner in just 30 minutes with self-service access and cost control, the solution enables creation of internal virtual labs, acceleration of complex genomic workloads and solving the needs of Secure Research Environments with Ingress/Egress controls.

To know more about virtual Cloud Analytics training labs and launching Genomics Research in less than 30 minutes explore the solution at https://research.rlcatalyst.com or feel free to write to marketing@relevancelab.com

References
Cloud Adoption for Scientific Research in a SAFE and Trusted Manner
Research Data Platform Enabling Scientific Research in Cloud with AWS Open-Source Solution
AWS Cloud Technology & Consulting Specialization for Products and Solutions
Health Informatics and Genomics on AWS with Research Gateway
UK Health Data Research Alliance – Aligning approach to Trusted Research Environments
Trusted (and Productive) Research Environments for Safe Research
Secure research environment for regulated data on Azure



0