Senior Infrastructure Engineer, AI/ML Platform

claven Uzbekistan
Remote
Apply
AI Summary

AI-native team seeking an experienced Infrastructure Engineer to own and evolve the Kubernetes-based GPU cluster and model-serving stack. Key responsibilities include operating, scaling, and securing GPU infrastructure, integrating AI/ML serving technologies, and building robust observability. Requires strong Kubernetes, Linux, networking fundamentals, and proficiency in IaC, observability, and CI/CD tools.

Key Highlights
Operate and evolve Kubernetes-based GPU clusters for AI/ML workloads.
Integrate and tune model-serving stacks (vLLM, Triton) for performance and cost.
Build observability for GPU utilization, workload performance, and cluster health.
Key Responsibilities
Operate and evolve our Kubernetes-based GPU cluster — scheduling, autoscaling, GPU partitioning (MIG, Hami, time-slicing), workload isolation.
Integrate and tune the model-serving stack (vLLM, NVIDIA Triton, llm-d, and similar) for cost, latency, and throughput.
Build observability for GPU utilization, workload performance, and cluster health (Prometheus, Grafana, OpenTelemetry, DCGM).
Own infrastructure-as-code and CI/CD for the platform (Terraform, Helm, Argo CD, GitHub Actions).
Implement multi-tenancy, network policy, and cluster-level security in support of our data sovereignty and air-gapped operating modes.
Partner with our backend engineer on the contracts where the infrastructure layer meets the portal and APIs.
Help diagnose and resolve production issues across the stack as we onboard pilot customers.
Technical Skills Required
Kubernetes Go Python Rust Terraform Helm Argo CD GitHub Actions Prometheus Grafana OpenTelemetry DCGM CUDA MIG NCCL NVIDIA Triton vLLM
Benefits & Perks
Ownership of a critical layer of the platform from day one
Direct work with founders and a small, AI-native engineering team
Fully remote, globally
Competitive compensation
Nice to Have
Hands-on with the NVIDIA stack (CUDA, MIG, NCCL, Triton, DCGM).
Experience with GPU partitioning or virtualization (Hami, MIG, time-slicing, MPS).
Serving LLMs in production (vLLM, llm-d, TGI, SGLang).
Multi-tenancy or compliance work in regulated industries.
Background in ML platforms, developer infrastructure, or internal PaaS-style products.
Familiarity with usage-based metering and quota systems.

Job Description


How We Work

We are an AI-native team. We use Claude Code daily, alongside other agentic AI tools and workflows to ship faster. This is not a perk, it is how the work happens here.


If you join us, we expect that:

  • You actively use Claude Code, Cursor, or similar agentic AI in your daily workflow — not as a curiosity, but as your default way of working.
  • When handed an unfamiliar system (Hami, MIG partitioning, vLLM internals, a new operator), you go deep on it with AI as your accelerator and come back with answers, prototypes, and opinions.
  • You take ownership. You don't wait to be told what to do once a problem is in your lane.
  • You maintain curiosity and learning velocity.


What You'll Own

You will own the infrastructure and orchestration layer that runs AI/ML workloads on our GPU clusters. Specifically:

  • Operate and evolve our Kubernetes-based GPU cluster — scheduling, autoscaling, GPU partitioning (MIG, Hami, time-slicing), workload isolation.
  • Integrate and tune the model-serving stack (vLLM, NVIDIA Triton, llm-d, and similar) for cost, latency, and throughput.
  • Build observability for GPU utilization, workload performance, and cluster health (Prometheus, Grafana, OpenTelemetry, DCGM).
  • Own infrastructure-as-code and CI/CD for the platform (Terraform, Helm, Argo CD, GitHub Actions).
  • Implement multi-tenancy, network policy, and cluster-level security in support of our data sovereignty and air-gapped operating modes.
  • Partner with our backend engineer on the contracts where the infrastructure layer meets the portal and APIs.
  • Help diagnose and resolve production issues across the stack as we onboard pilot customers.


What We're Looking For

We care more about what you can actually do than how many years it took you to get there. The bar is:

  • You have operated production Kubernetes end-to-end — upgrades, autoscaling, debugging incidents — not just deployed workloads to a cluster someone else runs.
  • Strong Linux, networking, and systems fundamentals.
  • Comfortable in at least one of Go, Python, or Rust for tooling, controllers, or operators.
  • Hands-on with infrastructure-as-code, observability tooling, and CI/CD pipelines.
  • Strong written English and clear, low-ego communication.
  • 4+ years in infrastructure, platform, or SRE roles is a useful soft floor, not a hard gate.


Bonus Points

None of these are required. They will accelerate your ramp-up, and we are happy to hire someone with strong fundamentals and zero GPU experience who is excited to learn.

  • Hands-on with the NVIDIA stack (CUDA, MIG, NCCL, Triton, DCGM).
  • Experience with GPU partitioning or virtualization (Hami, MIG, time-slicing, MPS).
  • Serving LLMs in production (vLLM, llm-d, TGI, SGLang).
  • Multi-tenancy or compliance work in regulated industries.
  • Background in ML platforms, developer infrastructure, or internal PaaS-style products.
  • Familiarity with usage-based metering and quota systems.


What You Get

  • Ownership of a critical layer of the platform from day one.
  • Direct work with the founders and a small, AI-native engineering team — no bureaucracy, no political layers.
  • A real customer pipeline in regulated markets — pilots, not vaporware.
  • Fully remote, globally. We hire for talent and attitude. Working hours should overlap meaningfully with the team.
  • Compensation is discussed during the hiring process and is competitive for the candidate's market. 



Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Bright Vision Technologies

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

bankable wisdom

Nigeria

Senior/Staff Site Reliability Engineer

Devops
7h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

remotehunter

United State

Subscribe our newsletter

New Things Will Always Update Regularly