Principal Cloud Platform Engineer

RCH Solutions • United State

Remote

Apply

AI Summary

RCH Solutions seeks a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. The ideal candidate will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads.

Key Highlights

Design, operate, and continuously improve production-grade K8s clusters at the platform level

Lead complex cluster lifecycle management

Build and maintain highly reliable, scalable, multi-tenant infrastructure

Key Responsibilities

Design, operate, and continuously improve production-grade K8s clusters at the platform level

Lead complex cluster lifecycle management

Build and maintain highly reliable, scalable, multi-tenant infrastructure

Implement and enforce RBAC and access control models

Operate and optimize vector database systems in production environments

Technical Skills Required

Kubernetes GKE Terraform GitOps ArgoCD Flux GCP IAM Workload Identity Secret Manager GCP Storage BigQuery GCS Firestore Prometheus Grafana Vector databases Weaviate Distributed Systems GCP Architecture

Benefits & Perks

Competitive salary and bonus package based on experience

Comprehensive health and wellness benefits

Company-provided Life and Long-Term Disability Insurance

Company-sponsored 401(k) Plan

Company-provided continuing education benefit

Nice to Have

Experience with Elasticsearch, OpenSearch, Azure AI Search or similar distributed search systems

Experience with Vector DBs other than Weviate: Milvus, Pinecone, Qdrant or pgvector

Experience in designing, building and maintaining end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith

Job Description

About Us

RCH Solutions is a rapidly growing global provider of computational science expertise within Life Sciences and Healthcare. At RCH, our team rallies around a culture crafted for learning and achieving. We’re relentless in our pursuit for innovation and demanding of ourselves to deliver a ground-breaking computing experience for our clients, so that they can deliver life-saving science to humanity.

Core Values

At RCH, our Core Values are more than just words—they represent the threads that weave together the fabric of our culture. Used as a guide when interviewing new team members; as a barometer when evaluating our performance as individuals and teams, and even when deciding which customers to work with, RCH’s Values embody the behaviors upon which we measure our success and create a framework for our growth as people and professionals.

Our Core Values:

Embrace Excellence: We strive for best-in-class delivery of innovation and service
Be Accountable: Integrity, ownership and accountability are non-negotiable
Adventure Together: We are committed to fostering a culture that embraces continuous improvement
Succeed as a Team: We believe harnessing the power of a team drives outcomes not achievable by individuals
Boundaries and Balance: Work-life balance is a core facet of our culture

If you share in our core values, then we encourage you to continue reading this posting as you may have found a great home for your career.

Job Description

RCH Solutions is seeking a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. This role is focused on platform-level engineering. You will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads, with a strong emphasis on Kubernetes cluster operations and vector database systems. You'll collaborate closely with Data Engineers and AI Engineers and support them by providing a cloud-hosted scalable multi-tenant infrastructure platform.

Key Responsibilities:

Platform & Infrastructure Engineering

Design, operate, and continuously improve production-grade K8s clusters at the platform level
Lead complex cluster lifecycle management, including:

Version upgrades and dependency coordination
Failure recovery and incident resolution
Non-trivial maintenance and system evolution
Build and maintain highly reliable, scalable, multi-tenant infrastructure

Build and maintain end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith — covering performance, latency, token usage, and alerting

Multi-Tenant Platform Architecture
Architect and operate shared infrastructure across multiple teams and use cases
Implement and enforce:

RBAC and access control models
Tenant isolation and security boundaries
Resource management and fairness at scale

Ensure platform stability under diverse and competing workloads

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Vector Retrieval & AI Infrastructure

Operate and optimize vector database systems (Weaviate preferred) in production environments
Support and scale Retrieval-Augmented Generation (RAG) systems
Drive improvements in:

Query performance and latency
Cluster tuning and resource efficiency
Operational stability of retrieval pipelines

Production Ownership & Reliability

Take technical ownership of production systems over time
Build and maintain strong practices in:

Observability (metrics, logs, tracing)
Incident response and root cause analysis
Long-term system health and resilience

Proactively identify and resolve reliability risks

Cross-Functional Collaboration

Work closely with backend and GenAI engineers to ensure seamless integration with the platform
Contribute to a balanced team structure, with a strong infrastructure core and targeted application-layer support

Essential Qualifications:

5+ years hands-on background in high-scale platform engineering (internal platforms, PaaS, or shared infra)
Deep Kubernetes Platform Expertise

Hands-on experience with GKE:

Cluster upgrades, node pool management, autoscaling
Managing failures, disruptions, and complex maintenance scenarios
RBAC, namespaces, network policies

GCP IAM, Workload Identity, Secret Manager
GCP Storage: BigQuery, GCS, Firestore

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Terraform and IaaC experience with GitOps workflows (ArgoCD, Flux or equivalent)
Strong observability practices using:

Google Cloud Operations Suite (Stackdriver)
Prometheus / Grafana

Hands-on experience operating vector databases in production, ideally Weaviate:

Query performance tuning
Cluster stability and scaling behavior

Distributed Systems & GCP Architecture

Solid understanding of distributed systems design and failure modes
Multi-zone / regional architectures
Google Cloud Load Balancing

Preferred Qualifications:

Experience with Elasticsearch, OpenSearch, Azure AI Search or similar distributed search systems
Experience with Vector DBs other than Weviate: Milvus, Pinecone, Qdrant or pgvector
Experience in designing, building and maintaining end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith: performance, latency, token usage, and alerting
Exposure to GenAI platforms and LLM-based applications
Experience in Life Science domain

Additional Information:

Great talent should benefit from a great work environment. If you join our team, you’ll have access to:

A competitive salary and bonus package based on experience
Comprehensive health and wellness benefits, including Medical, Dental, and Vision Insurance
Company-provided Life and Long-Term Disability Insurance
Company-sponsored 401(k) Plan
Company-provided continuing education benefit
Team-focused culture and unlimited opportunity for advancement

**This is a remote position and the candidate is expected to be able to work on an east coast (US) time schedule.

Role is only open to applicants not needing sponsorship now or in the future, no third parties please

Job Overview

Posted Date May 12, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company RCH Solutions

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Cloud Security Engineer

Devops

•

10m ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

Addison Group

United State

Senior AI Engineer

Devops

•

31m ago

Visa Sponsorship Relocation Remote

Job Type Contract

Experience Level Mid-Senior level

Eliassen Group

United State

DevOps Engineer (Experienced)

Devops

•

1h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Associate

clinician nexus

United State

Principal Cloud Platform Engineer

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Cloud Security Engineer

Addison Group

Senior AI Engineer

Eliassen Group

DevOps Engineer (Experienced)

clinician nexus

Subscribe our newsletter