RCH Solutions seeks a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. The ideal candidate will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
About Us
RCH Solutions is a rapidly growing global provider of computational science expertise within Life Sciences and Healthcare. At RCH, our team rallies around a culture crafted for learning and achieving. We’re relentless in our pursuit for innovation and demanding of ourselves to deliver a ground-breaking computing experience for our clients, so that they can deliver life-saving science to humanity.
Core Values
At RCH, our Core Values are more than just words—they represent the threads that weave together the fabric of our culture. Used as a guide when interviewing new team members; as a barometer when evaluating our performance as individuals and teams, and even when deciding which customers to work with, RCH’s Values embody the behaviors upon which we measure our success and create a framework for our growth as people and professionals.
Our Core Values:
- Embrace Excellence: We strive for best-in-class delivery of innovation and service
- Be Accountable: Integrity, ownership and accountability are non-negotiable
- Adventure Together: We are committed to fostering a culture that embraces continuous improvement
- Succeed as a Team: We believe harnessing the power of a team drives outcomes not achievable by individuals
- Boundaries and Balance: Work-life balance is a core facet of our culture
Job Description
RCH Solutions is seeking a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. This role is focused on platform-level engineering. You will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads, with a strong emphasis on Kubernetes cluster operations and vector database systems. You'll collaborate closely with Data Engineers and AI Engineers and support them by providing a cloud-hosted scalable multi-tenant infrastructure platform.
Key Responsibilities:
- Platform & Infrastructure Engineering
- Design, operate, and continuously improve production-grade K8s clusters at the platform level
- Lead complex cluster lifecycle management, including:
- Version upgrades and dependency coordination
- Failure recovery and incident resolution
- Non-trivial maintenance and system evolution
- Build and maintain highly reliable, scalable, multi-tenant infrastructure
- Build and maintain end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith — covering performance, latency, token usage, and alerting
- Multi-Tenant Platform Architecture
- Architect and operate shared infrastructure across multiple teams and use cases
- Implement and enforce:
- RBAC and access control models
- Tenant isolation and security boundaries
- Resource management and fairness at scale
- Ensure platform stability under diverse and competing workloads
- Vector Retrieval & AI Infrastructure
- Operate and optimize vector database systems (Weaviate preferred) in production environments
- Support and scale Retrieval-Augmented Generation (RAG) systems
- Drive improvements in:
- Query performance and latency
- Cluster tuning and resource efficiency
- Operational stability of retrieval pipelines
- Production Ownership & Reliability
- Take technical ownership of production systems over time
- Build and maintain strong practices in:
- Observability (metrics, logs, tracing)
- Incident response and root cause analysis
- Long-term system health and resilience
- Proactively identify and resolve reliability risks
- Cross-Functional Collaboration
- Work closely with backend and GenAI engineers to ensure seamless integration with the platform
- Contribute to a balanced team structure, with a strong infrastructure core and targeted application-layer support
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- 5+ years hands-on background in high-scale platform engineering (internal platforms, PaaS, or shared infra)
- Deep Kubernetes Platform Expertise
- Hands-on experience with GKE:
- Cluster upgrades, node pool management, autoscaling
- Managing failures, disruptions, and complex maintenance scenarios
- RBAC, namespaces, network policies
- GCP IAM, Workload Identity, Secret Manager
- GCP Storage: BigQuery, GCS, Firestore
- Terraform and IaaC experience with GitOps workflows (ArgoCD, Flux or equivalent)
- Strong observability practices using:
- Google Cloud Operations Suite (Stackdriver)
- Prometheus / Grafana
- Hands-on experience operating vector databases in production, ideally Weaviate:
- Query performance tuning
- Cluster stability and scaling behavior
- Distributed Systems & GCP Architecture
- Solid understanding of distributed systems design and failure modes
- Multi-zone / regional architectures
- Google Cloud Load Balancing
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
- Experience with Elasticsearch, OpenSearch, Azure AI Search or similar distributed search systems
- Experience with Vector DBs other than Weviate: Milvus, Pinecone, Qdrant or pgvector
- Experience in designing, building and maintaining end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith: performance, latency, token usage, and alerting
- Exposure to GenAI platforms and LLM-based applications
- Experience in Life Science domain
Great talent should benefit from a great work environment. If you join our team, you’ll have access to:
- A competitive salary and bonus package based on experience
- Comprehensive health and wellness benefits, including Medical, Dental, and Vision Insurance
- Company-provided Life and Long-Term Disability Insurance
- Company-sponsored 401(k) Plan
- Company-provided continuing education benefit
- Team-focused culture and unlimited opportunity for advancement
- Role is only open to applicants not needing sponsorship now or in the future, no third parties please
QKj2gZ9rqg
Similar Jobs
Explore other opportunities that match your interests
Addison Group
Eliassen Group