Kubernetes Architect (AI / GPU Platforms)

gtn technical staffing Dallas-fort Worth Metroplex
Relocation
Apply
AI Summary

Seeking a Kubernetes Architect to design and deliver GPU-accelerated container platforms for AI/ML and HPC workloads. This customer-facing role involves full solution lifecycle ownership, from discovery to optimization. Requires deep expertise in Kubernetes, NVIDIA GPU ecosystem, and high-performance infrastructure.

Key Highlights
Lead the design and delivery of GPU-accelerated Kubernetes platforms for AI/ML and HPC.
Full solution lifecycle ownership, acting as a trusted advisor to stakeholders and users.
Requires deep expertise in Kubernetes internals, NVIDIA GPU ecosystem, and high-performance infrastructure.
Key Responsibilities
Serve as the primary architectural lead for GPU-accelerated Kubernetes platforms supporting HPC and AI/ML workloads.
Translate complex workload requirements into scalable, production-ready reference architectures.
Lead discovery sessions, technical design workshops, and performance benchmarking engagements.
Guide customers through platform adoption, integration, and long-term optimization strategies.
Present architectural solutions and act as a subject matter expert in Kubernetes-based HPC environments.
Design and optimize Kubernetes clusters for GPU-intensive workloads in on-prem and hybrid environments.
Implement and tune NVIDIA ecosystem components, including GPU Operator, DCGM, MIG, and device plugins.
Optimize GPU scheduling and utilization through Kubernetes extensions (Volcano, Slurm integration, scheduler plugins).
Develop and extend Kubernetes operators and controllers (Go/Python) to automate infrastructure services.
Architect end-to-end platform integration across compute, storage, and networking layers.
Integrate high-performance storage solutions (Lustre, GPFS, Ceph, VAST) into Kubernetes environments.
Design and support high-performance networking (InfiniBand, RDMA, RoCE, NVLink) for distributed workloads.
Define multi-tenant architectures with strong isolation, security, and resource governance (RBAC, OPA/Gatekeeper).
Implement observability frameworks using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry.
Drive workload profiling, benchmarking, and performance tuning across distributed compute environments.
Support GitOps-based deployment models using ArgoCD, FluxCD, Helm, and Kustomize.
Partner with HPC and ML teams to validate performance and scalability at production scale.
Collaborate with internal product, engineering, and operations teams to influence platform roadmap.
Engage with key ecosystem partners (NVIDIA, networking and storage vendors) to integrate emerging technologies.
Provide forward-looking guidance on GPU architectures, interconnect evolution, and orchestration trends.
Technical Skills Required
Kubernetes NVIDIA GPU Operator NVIDIA DCGM NVIDIA MIG NVIDIA device plugins Volcano Slurm Go Python Lustre GPFS Ceph VAST InfiniBand RDMA RoCE NVLink RBAC OPA/Gatekeeper Prometheus Grafana DCGM Exporter OpenTelemetry ArgoCD FluxCD Helm Kustomize Apptainer/Singularity
Benefits & Perks
$175K–$250K base + performance bonus
100% company-paid benefits
Nice to Have
Experience delivering end-to-end customer solutions from design through deployment and adoption.
Familiarity with HPC workload orchestration tools (Slurm, Kubernetes schedulers, Apptainer/Singularity).
Exposure to GitOps and infrastructure-as-code practices in Kubernetes environments.
Contributions to open-source Kubernetes or GPU ecosystem projects.
Experience advising on long-term platform strategy and emerging technology adoption.
Relevant certifications such as CKA, CKAD, CKS, or cloud architecture certifications (AWS, Azure).

Job Description


Kubernetes Architect (AI / GPU Platforms)

Location: Dallas, TX (Hybrid – 3/2) | Relocation available

Type: Direct Hire

• $175K–$250K base + performance bonus

• 100% company-paid benefits

Overview

We are seeking a Kubernetes Architect to lead the design and delivery of GPU-accelerated container platforms supporting next-generation AI, machine learning, and high-performance computing workloads.

This organization operates at the forefront of large-scale compute infrastructure, building platforms that power scientific research, advanced simulation, and data-intensive innovation. This role sits at the intersection of Kubernetes, HPC, and GPU infrastructure, driving architecture decisions that directly impact performance, scalability, and multi-tenant platform efficiency.

This is a customer-facing architecture role with ownership across the full solution lifecycle, from early discovery and requirements definition through design, proof-of-concept, deployment, and long-term optimization. You will serve as a trusted advisor to both internal stakeholders and external users, shaping how GPU-based Kubernetes platforms are built and scaled across complex environments.

Key Responsibilities

Architecture & Customer Engagement

• Serve as the primary architectural lead for GPU-accelerated Kubernetes platforms supporting HPC and AI/ML workloads

• Translate complex workload requirements into scalable, production-ready reference architectures

• Lead discovery sessions, technical design workshops, and performance benchmarking engagements

• Guide customers through platform adoption, integration, and long-term optimization strategies

• Present architectural solutions and act as a subject matter expert in Kubernetes-based HPC environments

Kubernetes & GPU Platform Engineering

• Design and optimize Kubernetes clusters for GPU-intensive workloads in on-prem and hybrid environments

• Implement and tune NVIDIA ecosystem components, including GPU Operator, DCGM, MIG, and device plugins

• Optimize GPU scheduling and utilization through Kubernetes extensions (Volcano, Slurm integration, scheduler plugins)

• Develop and extend Kubernetes operators and controllers (Go/Python) to automate infrastructure services

Infrastructure Integration (Compute, Storage, Network)

• Architect end-to-end platform integration across compute, storage, and networking layers

• Integrate high-performance storage solutions (Lustre, GPFS, Ceph, VAST) into Kubernetes environments

• Design and support high-performance networking (InfiniBand, RDMA, RoCE, NVLink) for distributed workloads

• Define multi-tenant architectures with strong isolation, security, and resource governance (RBAC, OPA/Gatekeeper)

Observability, Automation & Performance

• Implement observability frameworks using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry

• Drive workload profiling, benchmarking, and performance tuning across distributed compute environments

• Support GitOps-based deployment models using ArgoCD, FluxCD, Helm, and Kustomize

• Partner with HPC and ML teams to validate performance and scalability at production scale

Ecosystem & Product Collaboration

• Collaborate with internal product, engineering, and operations teams to influence platform roadmap

• Engage with key ecosystem partners (NVIDIA, networking and storage vendors) to integrate emerging technologies

• Provide forward-looking guidance on GPU architectures, interconnect evolution, and orchestration trends

Required Experience

• Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments

• Deep expertise with the NVIDIA GPU ecosystem (GPU Operator, MIG, DCGM, device plugins)

• Strong understanding of Kubernetes internals, including CRDs, RBAC, scheduling, and custom controllers

• Experience integrating distributed storage systems for high-performance workloads

• Strong knowledge of high-performance networking (InfiniBand, RDMA, RoCE) in containerized environments

• Proven ability to design scalable, secure, and highly available distributed compute platforms

• Proficiency in Go or Python for infrastructure automation or operator development

• Experience with workload benchmarking, profiling, and performance optimization

• Strong communication skills with the ability to translate complex technical concepts into actionable solutions

Preferred Experience

• Experience delivering end-to-end customer solutions from design through deployment and adoption

• Familiarity with HPC workload orchestration tools (Slurm, Kubernetes schedulers, Apptainer/Singularity)

• Exposure to GitOps and infrastructure-as-code practices in Kubernetes environments

• Contributions to open-source Kubernetes or GPU ecosystem projects

• Experience advising on long-term platform strategy and emerging technology adoption

• Relevant certifications such as CKA, CKAD, CKS, or cloud architecture certifications (AWS, Azure)

Why This Role

• High-impact role shaping next-generation AI and HPC infrastructure

• Direct influence on platform architecture, performance, and scalability at scale

• Strong visibility across engineering, product, and customer environments

• Backed by significant investment and long-term growth in AI compute platforms


Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Techaxis, Inc

Dallas-fort Worth Metroplex

HPC Kubernetes Architect (GPU)

Devops
1w ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

gtn technical staffing

Dallas-fort Worth Metroplex

Director of Engineering - Decision Intelligence Platform

Devops
4w ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

AT&T

Dallas-fort Worth Metroplex

Subscribe our newsletter

New Things Will Always Update Regularly