Principal Cloud Platform Engineer

RCH Solutions United State
Remote
Apply
AI Summary

RCH Solutions seeks a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. The ideal candidate will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads.

Key Highlights
Design, operate, and continuously improve production-grade K8s clusters at the platform level
Lead complex cluster lifecycle management
Build and maintain highly reliable, scalable, multi-tenant infrastructure
Key Responsibilities
Design, operate, and continuously improve production-grade K8s clusters at the platform level
Lead complex cluster lifecycle management
Build and maintain highly reliable, scalable, multi-tenant infrastructure
Implement and enforce RBAC and access control models
Operate and optimize vector database systems in production environments
Technical Skills Required
Kubernetes GKE Terraform GitOps ArgoCD Flux GCP IAM Workload Identity Secret Manager GCP Storage BigQuery GCS Firestore Prometheus Grafana Vector databases Weaviate Distributed Systems GCP Architecture
Benefits & Perks
Competitive salary and bonus package based on experience
Comprehensive health and wellness benefits
Company-provided Life and Long-Term Disability Insurance
Company-sponsored 401(k) Plan
Company-provided continuing education benefit
Nice to Have
Experience with Elasticsearch, OpenSearch, Azure AI Search or similar distributed search systems
Experience with Vector DBs other than Weviate: Milvus, Pinecone, Qdrant or pgvector
Experience in designing, building and maintaining end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith

Job Description


About Us

RCH Solutions is a rapidly growing global provider of computational science expertise within Life Sciences and Healthcare. At RCH, our team rallies around a culture crafted for learning and achieving. We’re relentless in our pursuit for innovation and demanding of ourselves to deliver a ground-breaking computing experience for our clients, so that they can deliver life-saving science to humanity.

Core Values

At RCH, our Core Values are more than just words—they represent the threads that weave together the fabric of our culture. Used as a guide when interviewing new team members; as a barometer when evaluating our performance as individuals and teams, and even when deciding which customers to work with, RCH’s Values embody the behaviors upon which we measure our success and create a framework for our growth as people and professionals.

Our Core Values:

  • Embrace Excellence: We strive for best-in-class delivery of innovation and service
  • Be Accountable: Integrity, ownership and accountability are non-negotiable
  • Adventure Together: We are committed to fostering a culture that embraces continuous improvement
  • Succeed as a Team: We believe harnessing the power of a team drives outcomes not achievable by individuals
  • Boundaries and Balance: Work-life balance is a core facet of our culture

If you share in our core values, then we encourage you to continue reading this posting as you may have found a great home for your career.

Job Description

RCH Solutions is seeking a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. This role is focused on platform-level engineering. You will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads, with a strong emphasis on Kubernetes cluster operations and vector database systems. You'll collaborate closely with Data Engineers and AI Engineers and support them by providing a cloud-hosted scalable multi-tenant infrastructure platform.

Key Responsibilities:

  • Platform & Infrastructure Engineering
    • Design, operate, and continuously improve production-grade K8s clusters at the platform level
    • Lead complex cluster lifecycle management, including:
      • Version upgrades and dependency coordination
      • Failure recovery and incident resolution
      • Non-trivial maintenance and system evolution
      • Build and maintain highly reliable, scalable, multi-tenant infrastructure
    • Build and maintain end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith — covering performance, latency, token usage, and alerting
  • Multi-Tenant Platform Architecture
  • Architect and operate shared infrastructure across multiple teams and use cases
  • Implement and enforce:
    • RBAC and access control models
    • Tenant isolation and security boundaries
    • Resource management and fairness at scale
  • Ensure platform stability under diverse and competing workloads
  • Vector Retrieval & AI Infrastructure
    • Operate and optimize vector database systems (Weaviate preferred) in production environments
    • Support and scale Retrieval-Augmented Generation (RAG) systems
    • Drive improvements in:
      • Query performance and latency
      • Cluster tuning and resource efficiency
      • Operational stability of retrieval pipelines
  • Production Ownership & Reliability
    • Take technical ownership of production systems over time
    • Build and maintain strong practices in:
      • Observability (metrics, logs, tracing)
      • Incident response and root cause analysis
      • Long-term system health and resilience
    • Proactively identify and resolve reliability risks
  • Cross-Functional Collaboration
    • Work closely with backend and GenAI engineers to ensure seamless integration with the platform
    • Contribute to a balanced team structure, with a strong infrastructure core and targeted application-layer support
Essential Qualifications:

  • 5+ years hands-on background in high-scale platform engineering (internal platforms, PaaS, or shared infra)
  • Deep Kubernetes Platform Expertise
    • Hands-on experience with GKE:
      • Cluster upgrades, node pool management, autoscaling
      • Managing failures, disruptions, and complex maintenance scenarios
      • RBAC, namespaces, network policies
  • GCP IAM, Workload Identity, Secret Manager
  • GCP Storage: BigQuery, GCS, Firestore
  • Terraform and IaaC experience with GitOps workflows (ArgoCD, Flux or equivalent)
  • Strong observability practices using:
    • Google Cloud Operations Suite (Stackdriver)
    • Prometheus / Grafana
  • Hands-on experience operating vector databases in production, ideally Weaviate:
    • Query performance tuning
    • Cluster stability and scaling behavior
  • Distributed Systems & GCP Architecture
    • Solid understanding of distributed systems design and failure modes
    • Multi-zone / regional architectures
    • Google Cloud Load Balancing
Preferred Qualifications:

  • Experience with Elasticsearch, OpenSearch, Azure AI Search or similar distributed search systems
  • Experience with Vector DBs other than Weviate: Milvus, Pinecone, Qdrant or pgvector
  • Experience in designing, building and maintaining end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith: performance, latency, token usage, and alerting
  • Exposure to GenAI platforms and LLM-based applications
  • Experience in Life Science domain

Additional Information:

Great talent should benefit from a great work environment. If you join our team, you’ll have access to:

  • A competitive salary and bonus package based on experience
  • Comprehensive health and wellness benefits, including Medical, Dental, and Vision Insurance
  • Company-provided Life and Long-Term Disability Insurance
  • Company-sponsored 401(k) Plan
  • Company-provided continuing education benefit
  • Team-focused culture and unlimited opportunity for advancement

**This is a remote position and the candidate is expected to be able to work on an east coast (US) time schedule.

  • Role is only open to applicants not needing sponsorship now or in the future, no third parties please

Powered by JazzHR

QKj2gZ9rqg

Similar Jobs

Explore other opportunities that match your interests

Cloud Security Engineer

Devops
10m ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Addison Group

United State

Senior AI Engineer

Devops
31m ago
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

Eliassen Group

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Associate

clinician nexus

United State

Subscribe our newsletter

New Things Will Always Update Regularly