Senior Infrastructure Engineer, AI/ML Platform

claven • Uzbekistan

Remote

Apply

AI Summary

AI-native team seeking an experienced Infrastructure Engineer to own and evolve the Kubernetes-based GPU cluster and model-serving stack. Key responsibilities include operating, scaling, and securing GPU infrastructure, integrating AI/ML serving technologies, and building robust observability. Requires strong Kubernetes, Linux, networking fundamentals, and proficiency in IaC, observability, and CI/CD tools.

Key Highlights

Operate and evolve Kubernetes-based GPU clusters for AI/ML workloads.

Integrate and tune model-serving stacks (vLLM, Triton) for performance and cost.

Build observability for GPU utilization, workload performance, and cluster health.

Key Responsibilities

Operate and evolve our Kubernetes-based GPU cluster — scheduling, autoscaling, GPU partitioning (MIG, Hami, time-slicing), workload isolation.

Integrate and tune the model-serving stack (vLLM, NVIDIA Triton, llm-d, and similar) for cost, latency, and throughput.

Build observability for GPU utilization, workload performance, and cluster health (Prometheus, Grafana, OpenTelemetry, DCGM).

Own infrastructure-as-code and CI/CD for the platform (Terraform, Helm, Argo CD, GitHub Actions).

Implement multi-tenancy, network policy, and cluster-level security in support of our data sovereignty and air-gapped operating modes.

Partner with our backend engineer on the contracts where the infrastructure layer meets the portal and APIs.

Help diagnose and resolve production issues across the stack as we onboard pilot customers.

Technical Skills Required

Kubernetes Go Python Rust Terraform Helm Argo CD GitHub Actions Prometheus Grafana OpenTelemetry DCGM CUDA MIG NCCL NVIDIA Triton vLLM

Benefits & Perks

Ownership of a critical layer of the platform from day one

Direct work with founders and a small, AI-native engineering team

Fully remote, globally

Competitive compensation

Nice to Have

Hands-on with the NVIDIA stack (CUDA, MIG, NCCL, Triton, DCGM).

Experience with GPU partitioning or virtualization (Hami, MIG, time-slicing, MPS).

Serving LLMs in production (vLLM, llm-d, TGI, SGLang).

Multi-tenancy or compliance work in regulated industries.

Background in ML platforms, developer infrastructure, or internal PaaS-style products.

Familiarity with usage-based metering and quota systems.

Job Description

How We Work

We are an AI-native team. We use Claude Code daily, alongside other agentic AI tools and workflows to ship faster. This is not a perk, it is how the work happens here.

If you join us, we expect that:

You actively use Claude Code, Cursor, or similar agentic AI in your daily workflow — not as a curiosity, but as your default way of working.
When handed an unfamiliar system (Hami, MIG partitioning, vLLM internals, a new operator), you go deep on it with AI as your accelerator and come back with answers, prototypes, and opinions.
You take ownership. You don't wait to be told what to do once a problem is in your lane.
You maintain curiosity and learning velocity.

What You'll Own

You will own the infrastructure and orchestration layer that runs AI/ML workloads on our GPU clusters. Specifically:

Operate and evolve our Kubernetes-based GPU cluster — scheduling, autoscaling, GPU partitioning (MIG, Hami, time-slicing), workload isolation.
Integrate and tune the model-serving stack (vLLM, NVIDIA Triton, llm-d, and similar) for cost, latency, and throughput.
Build observability for GPU utilization, workload performance, and cluster health (Prometheus, Grafana, OpenTelemetry, DCGM).
Own infrastructure-as-code and CI/CD for the platform (Terraform, Helm, Argo CD, GitHub Actions).

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Implement multi-tenancy, network policy, and cluster-level security in support of our data sovereignty and air-gapped operating modes.
Partner with our backend engineer on the contracts where the infrastructure layer meets the portal and APIs.
Help diagnose and resolve production issues across the stack as we onboard pilot customers.

What We're Looking For

We care more about what you can actually do than how many years it took you to get there. The bar is:

You have operated production Kubernetes end-to-end — upgrades, autoscaling, debugging incidents — not just deployed workloads to a cluster someone else runs.
Strong Linux, networking, and systems fundamentals.
Comfortable in at least one of Go, Python, or Rust for tooling, controllers, or operators.
Hands-on with infrastructure-as-code, observability tooling, and CI/CD pipelines.
Strong written English and clear, low-ego communication.
4+ years in infrastructure, platform, or SRE roles is a useful soft floor, not a hard gate.

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Bonus Points

None of these are required. They will accelerate your ramp-up, and we are happy to hire someone with strong fundamentals and zero GPU experience who is excited to learn.

Hands-on with the NVIDIA stack (CUDA, MIG, NCCL, Triton, DCGM).
Experience with GPU partitioning or virtualization (Hami, MIG, time-slicing, MPS).
Serving LLMs in production (vLLM, llm-d, TGI, SGLang).
Multi-tenancy or compliance work in regulated industries.
Background in ML platforms, developer infrastructure, or internal PaaS-style products.
Familiarity with usage-based metering and quota systems.

What You Get

Ownership of a critical layer of the platform from day one.
Direct work with the founders and a small, AI-native engineering team — no bureaucracy, no political layers.
A real customer pipeline in regulated markets — pilots, not vaporware.
Fully remote, globally. We hire for talent and attitude. Working hours should overlap meaningfully with the team.
Compensation is discussed during the hiring process and is competitive for the candidate's market.

Job Overview

Posted Date May 26, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location Uzbekistan

Category Devops

Company claven

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Oracle Cloud Security Engineer

Devops

•

7h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

Bright Vision Technologies

United State

Remote IT Trainer (Cloud, Cybersecurity, and DevOps)

Devops

•

7h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

bankable wisdom

Nigeria

Senior/Staff Site Reliability Engineer

Devops

•

7h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

remotehunter

United State

Senior Infrastructure Engineer, AI/ML Platform

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Oracle Cloud Security Engineer

Bright Vision Technologies

Remote IT Trainer (Cloud, Cybersecurity, and DevOps)

bankable wisdom

Senior/Staff Site Reliability Engineer

Premium Job

remotehunter

Subscribe our newsletter