Kubernetes Architect (AI / GPU Platforms)

gtn technical staffing • Dallas-fort Worth Metroplex

Relocation

Apply

AI Summary

Seeking a Kubernetes Architect to design and deliver GPU-accelerated container platforms for AI/ML and HPC workloads. This customer-facing role involves full solution lifecycle ownership, from discovery to optimization. Requires deep expertise in Kubernetes, NVIDIA GPU ecosystem, and high-performance infrastructure.

Key Highlights

Lead the design and delivery of GPU-accelerated Kubernetes platforms for AI/ML and HPC.

Full solution lifecycle ownership, acting as a trusted advisor to stakeholders and users.

Requires deep expertise in Kubernetes internals, NVIDIA GPU ecosystem, and high-performance infrastructure.

Key Responsibilities

Serve as the primary architectural lead for GPU-accelerated Kubernetes platforms supporting HPC and AI/ML workloads.

Translate complex workload requirements into scalable, production-ready reference architectures.

Lead discovery sessions, technical design workshops, and performance benchmarking engagements.

Guide customers through platform adoption, integration, and long-term optimization strategies.

Present architectural solutions and act as a subject matter expert in Kubernetes-based HPC environments.

Design and optimize Kubernetes clusters for GPU-intensive workloads in on-prem and hybrid environments.

Implement and tune NVIDIA ecosystem components, including GPU Operator, DCGM, MIG, and device plugins.

Optimize GPU scheduling and utilization through Kubernetes extensions (Volcano, Slurm integration, scheduler plugins).

Develop and extend Kubernetes operators and controllers (Go/Python) to automate infrastructure services.

Architect end-to-end platform integration across compute, storage, and networking layers.

Integrate high-performance storage solutions (Lustre, GPFS, Ceph, VAST) into Kubernetes environments.

Design and support high-performance networking (InfiniBand, RDMA, RoCE, NVLink) for distributed workloads.

Define multi-tenant architectures with strong isolation, security, and resource governance (RBAC, OPA/Gatekeeper).

Implement observability frameworks using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry.

Drive workload profiling, benchmarking, and performance tuning across distributed compute environments.

Support GitOps-based deployment models using ArgoCD, FluxCD, Helm, and Kustomize.

Partner with HPC and ML teams to validate performance and scalability at production scale.

Collaborate with internal product, engineering, and operations teams to influence platform roadmap.

Engage with key ecosystem partners (NVIDIA, networking and storage vendors) to integrate emerging technologies.

Provide forward-looking guidance on GPU architectures, interconnect evolution, and orchestration trends.

Technical Skills Required

Kubernetes NVIDIA GPU Operator NVIDIA DCGM NVIDIA MIG NVIDIA device plugins Volcano Slurm Go Python Lustre GPFS Ceph VAST InfiniBand RDMA RoCE NVLink RBAC OPA/Gatekeeper Prometheus Grafana DCGM Exporter OpenTelemetry ArgoCD FluxCD Helm Kustomize Apptainer/Singularity

Benefits & Perks

$175K–$250K base + performance bonus

100% company-paid benefits

Nice to Have

Experience delivering end-to-end customer solutions from design through deployment and adoption.

Familiarity with HPC workload orchestration tools (Slurm, Kubernetes schedulers, Apptainer/Singularity).

Exposure to GitOps and infrastructure-as-code practices in Kubernetes environments.

Contributions to open-source Kubernetes or GPU ecosystem projects.

Experience advising on long-term platform strategy and emerging technology adoption.

Relevant certifications such as CKA, CKAD, CKS, or cloud architecture certifications (AWS, Azure).

Job Description

Kubernetes Architect (AI / GPU Platforms)

Location: Dallas, TX (Hybrid – 3/2) | Relocation available

Type: Direct Hire

• $175K–$250K base + performance bonus

• 100% company-paid benefits

Overview

We are seeking a Kubernetes Architect to lead the design and delivery of GPU-accelerated container platforms supporting next-generation AI, machine learning, and high-performance computing workloads.

This organization operates at the forefront of large-scale compute infrastructure, building platforms that power scientific research, advanced simulation, and data-intensive innovation. This role sits at the intersection of Kubernetes, HPC, and GPU infrastructure, driving architecture decisions that directly impact performance, scalability, and multi-tenant platform efficiency.

This is a customer-facing architecture role with ownership across the full solution lifecycle, from early discovery and requirements definition through design, proof-of-concept, deployment, and long-term optimization. You will serve as a trusted advisor to both internal stakeholders and external users, shaping how GPU-based Kubernetes platforms are built and scaled across complex environments.

Key Responsibilities

Architecture & Customer Engagement

• Serve as the primary architectural lead for GPU-accelerated Kubernetes platforms supporting HPC and AI/ML workloads

• Translate complex workload requirements into scalable, production-ready reference architectures

• Lead discovery sessions, technical design workshops, and performance benchmarking engagements

• Guide customers through platform adoption, integration, and long-term optimization strategies

• Present architectural solutions and act as a subject matter expert in Kubernetes-based HPC environments

Kubernetes & GPU Platform Engineering

• Design and optimize Kubernetes clusters for GPU-intensive workloads in on-prem and hybrid environments

• Implement and tune NVIDIA ecosystem components, including GPU Operator, DCGM, MIG, and device plugins

Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.

• Optimize GPU scheduling and utilization through Kubernetes extensions (Volcano, Slurm integration, scheduler plugins)

• Develop and extend Kubernetes operators and controllers (Go/Python) to automate infrastructure services

Infrastructure Integration (Compute, Storage, Network)

• Architect end-to-end platform integration across compute, storage, and networking layers

• Integrate high-performance storage solutions (Lustre, GPFS, Ceph, VAST) into Kubernetes environments

• Design and support high-performance networking (InfiniBand, RDMA, RoCE, NVLink) for distributed workloads

• Define multi-tenant architectures with strong isolation, security, and resource governance (RBAC, OPA/Gatekeeper)

Observability, Automation & Performance

• Implement observability frameworks using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry

• Drive workload profiling, benchmarking, and performance tuning across distributed compute environments

• Support GitOps-based deployment models using ArgoCD, FluxCD, Helm, and Kustomize

• Partner with HPC and ML teams to validate performance and scalability at production scale

Ecosystem & Product Collaboration

• Collaborate with internal product, engineering, and operations teams to influence platform roadmap

• Engage with key ecosystem partners (NVIDIA, networking and storage vendors) to integrate emerging technologies

• Provide forward-looking guidance on GPU architectures, interconnect evolution, and orchestration trends

Required Experience

• Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments

Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.

• Deep expertise with the NVIDIA GPU ecosystem (GPU Operator, MIG, DCGM, device plugins)

• Strong understanding of Kubernetes internals, including CRDs, RBAC, scheduling, and custom controllers

• Experience integrating distributed storage systems for high-performance workloads

• Strong knowledge of high-performance networking (InfiniBand, RDMA, RoCE) in containerized environments

• Proven ability to design scalable, secure, and highly available distributed compute platforms

• Proficiency in Go or Python for infrastructure automation or operator development

• Experience with workload benchmarking, profiling, and performance optimization

• Strong communication skills with the ability to translate complex technical concepts into actionable solutions

Preferred Experience

• Experience delivering end-to-end customer solutions from design through deployment and adoption

• Familiarity with HPC workload orchestration tools (Slurm, Kubernetes schedulers, Apptainer/Singularity)

• Exposure to GitOps and infrastructure-as-code practices in Kubernetes environments

• Contributions to open-source Kubernetes or GPU ecosystem projects

• Experience advising on long-term platform strategy and emerging technology adoption

• Relevant certifications such as CKA, CKAD, CKS, or cloud architecture certifications (AWS, Azure)

Why This Role

• High-impact role shaping next-generation AI and HPC infrastructure

• Direct influence on platform architecture, performance, and scalability at scale

• Strong visibility across engineering, product, and customer environments

• Backed by significant investment and long-term growth in AI compute platforms

Job Overview

Posted Date May 07, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location Dallas-fort Worth Metroplex

Annual Salary 175,000 - 250,000 USD

Category Devops

Company gtn technical staffing

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Cloud Security Controls Engineer

Devops

•

1w ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

Techaxis, Inc

Dallas-fort Worth Metroplex

HPC Kubernetes Architect (GPU)

Devops

•

1w ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

gtn technical staffing

Dallas-fort Worth Metroplex

Director of Engineering - Decision Intelligence Platform

Devops

•

4w ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

AT&T

Dallas-fort Worth Metroplex

Kubernetes Architect (AI / GPU Platforms)

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Cloud Security Controls Engineer

Techaxis, Inc

HPC Kubernetes Architect (GPU)

Premium Job

gtn technical staffing

Director of Engineering - Decision Intelligence Platform

Premium Job

AT&T

Subscribe our newsletter