Kubernetes Architect (AI / GPU Platforms)
Seeking a Kubernetes Architect to design and deliver GPU-accelerated container platforms for AI/ML and HPC workloads. This customer-facing role involves full solution lifecycle ownership, from discovery to optimization. Requires deep expertise in Kubernetes, NVIDIA GPU ecosystem, and high-performance infrastructure.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
Kubernetes Architect (AI / GPU Platforms)
Location: Dallas, TX (Hybrid – 3/2) | Relocation available
Type: Direct Hire
• $175K–$250K base + performance bonus
• 100% company-paid benefits
Overview
We are seeking a Kubernetes Architect to lead the design and delivery of GPU-accelerated container platforms supporting next-generation AI, machine learning, and high-performance computing workloads.
This organization operates at the forefront of large-scale compute infrastructure, building platforms that power scientific research, advanced simulation, and data-intensive innovation. This role sits at the intersection of Kubernetes, HPC, and GPU infrastructure, driving architecture decisions that directly impact performance, scalability, and multi-tenant platform efficiency.
This is a customer-facing architecture role with ownership across the full solution lifecycle, from early discovery and requirements definition through design, proof-of-concept, deployment, and long-term optimization. You will serve as a trusted advisor to both internal stakeholders and external users, shaping how GPU-based Kubernetes platforms are built and scaled across complex environments.
Key Responsibilities
Architecture & Customer Engagement
• Serve as the primary architectural lead for GPU-accelerated Kubernetes platforms supporting HPC and AI/ML workloads
• Translate complex workload requirements into scalable, production-ready reference architectures
• Lead discovery sessions, technical design workshops, and performance benchmarking engagements
• Guide customers through platform adoption, integration, and long-term optimization strategies
• Present architectural solutions and act as a subject matter expert in Kubernetes-based HPC environments
Kubernetes & GPU Platform Engineering
• Design and optimize Kubernetes clusters for GPU-intensive workloads in on-prem and hybrid environments
• Implement and tune NVIDIA ecosystem components, including GPU Operator, DCGM, MIG, and device plugins
Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
• Optimize GPU scheduling and utilization through Kubernetes extensions (Volcano, Slurm integration, scheduler plugins)
• Develop and extend Kubernetes operators and controllers (Go/Python) to automate infrastructure services
Infrastructure Integration (Compute, Storage, Network)
• Architect end-to-end platform integration across compute, storage, and networking layers
• Integrate high-performance storage solutions (Lustre, GPFS, Ceph, VAST) into Kubernetes environments
• Design and support high-performance networking (InfiniBand, RDMA, RoCE, NVLink) for distributed workloads
• Define multi-tenant architectures with strong isolation, security, and resource governance (RBAC, OPA/Gatekeeper)
Observability, Automation & Performance
• Implement observability frameworks using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
• Drive workload profiling, benchmarking, and performance tuning across distributed compute environments
• Support GitOps-based deployment models using ArgoCD, FluxCD, Helm, and Kustomize
• Partner with HPC and ML teams to validate performance and scalability at production scale
Ecosystem & Product Collaboration
• Collaborate with internal product, engineering, and operations teams to influence platform roadmap
• Engage with key ecosystem partners (NVIDIA, networking and storage vendors) to integrate emerging technologies
• Provide forward-looking guidance on GPU architectures, interconnect evolution, and orchestration trends
Required Experience
• Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
• Deep expertise with the NVIDIA GPU ecosystem (GPU Operator, MIG, DCGM, device plugins)
• Strong understanding of Kubernetes internals, including CRDs, RBAC, scheduling, and custom controllers
• Experience integrating distributed storage systems for high-performance workloads
• Strong knowledge of high-performance networking (InfiniBand, RDMA, RoCE) in containerized environments
• Proven ability to design scalable, secure, and highly available distributed compute platforms
• Proficiency in Go or Python for infrastructure automation or operator development
• Experience with workload benchmarking, profiling, and performance optimization
• Strong communication skills with the ability to translate complex technical concepts into actionable solutions
Preferred Experience
• Experience delivering end-to-end customer solutions from design through deployment and adoption
• Familiarity with HPC workload orchestration tools (Slurm, Kubernetes schedulers, Apptainer/Singularity)
• Exposure to GitOps and infrastructure-as-code practices in Kubernetes environments
• Contributions to open-source Kubernetes or GPU ecosystem projects
• Experience advising on long-term platform strategy and emerging technology adoption
• Relevant certifications such as CKA, CKAD, CKS, or cloud architecture certifications (AWS, Azure)
Why This Role
• High-impact role shaping next-generation AI and HPC infrastructure
• Direct influence on platform architecture, performance, and scalability at scale
• Strong visibility across engineering, product, and customer environments
• Backed by significant investment and long-term growth in AI compute platforms
Similar Jobs
Explore other opportunities that match your interests
Techaxis, Inc
HPC Kubernetes Architect (GPU)
gtn technical staffing
Director of Engineering - Decision Intelligence Platform