Lead the architecture and modernization of large-scale compute platforms supporting AI/ML and scientific workloads. Design and optimize job scheduling strategies using Slurm, enable GPU-aware container orchestration, and optimize MPI-based distributed workloads. Implement centralized observability and integrate Kubernetes with traditional schedulers.
Key Highlights
Key Responsibilities
Technical Skills Required
Nice to Have
Job Description
Job Title: Senior HPC & Kubernetes Architect
📍Location: Albany, NY (Relocation Required)
Full-Time | Hybrid/Onsite
Job Description:
- We are seeking a handson Senior Architect with deep expertise in High Performance Computing (HPC) and Kubernetes to lead the architecture and modernization of large-scale compute platforms supporting AI/ML and scientific workloads.
- This role requires demonstrated production experience designing Slurm-based HPC clusters, integrating GPU enabled workloads with Kubernetes, and optimizing MPI-driven applications in hybrid cloud environments.
- This is not a general DevOps or Kubernetes administration role. Candidates must have production HPC cluster architecture experience.
Core Responsibilities:
- Architect hybrid environments integrating dedicated HPC clusters with Kubernetes-based container platforms.
- Design and optimize job scheduling strategies using Slurm (required), including priority queues, gang scheduling, and deterministic resource allocation.
- Enable GPU-aware container orchestration using Docker and Kubernetes with near bare-metal performance.
- Optimize MPI-based distributed workloads with low-latency networking (InfiniBand preferred).
- Design infrastructure automation pipelines using Terraform and Ansible.
- Architect secure multi-tenant environments with RBAC and TLS across cluster and container layers.
- Implement centralized observability using Elasticsearch, Logstash, and Kibana for large-scale job monitoring.
- Integrate Kubernetes with traditional schedulers to support AI/ML and high-throughput compute workloads.
- Support AWS-based HPC workloads including EKS, EC2 GPU instances, AWS Batch, and FSx for Lustre.
Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
Required Technical Expertise:
- 5+ years hands-on experience architecting and operating production HPC clusters.
- Deep experience with Slurm scheduler (configuration, partitioning, job queues, resource management).
- Strong knowledge of MPI (OpenMPI/MPICH) and distributed compute models.
- Experience with GPU clusters (NVIDIA CUDA) and GPU scheduling.
- Advanced Kubernetes architecture knowledge (cluster networking, resource quotas, performance tuning).
- Experience running containers in HPC environments (Docker; NVIDIA runtime experience strongly preferred).
- Proven experience with Infrastructure-as-Code (Terraform and/or Ansible).
- Strong Linux systems engineering background.
- Experience with AWS HPC services (EKS, EC2 GPU, FSx for Lustre, AWS Batch).
Strongly Preferred:
- Experience with NVIDIA Enroot and/or Pyxis.
- Exposure to InfiniBand networking.
- Experience integrating Kubernetes-native batch schedulers (Volcano).
- ELK Stack architecture at enterprise scale.
- CKA or equivalent certification.
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
Candidate Profile:
Ideal candidates will have backgrounds in:
- Research computing environments
- National laboratories
- AI/ML infrastructure platforms
- Financial modeling or quantitative computing
- Scientific simulation platforms
Important
- Applicants without direct Slurm or production HPC cluster experience will not be considered.
Similar Jobs
Explore other opportunities that match your interests
yora sigma
Azure Cloud Architect
bass pro shops
Java Server Engineer