AI-native team seeking an experienced Infrastructure Engineer to own and evolve the Kubernetes-based GPU cluster and model-serving stack. Key responsibilities include operating, scaling, and securing GPU infrastructure, integrating AI/ML serving technologies, and building robust observability. Requires strong Kubernetes, Linux, networking fundamentals, and proficiency in IaC, observability, and CI/CD tools.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
How We Work
We are an AI-native team. We use Claude Code daily, alongside other agentic AI tools and workflows to ship faster. This is not a perk, it is how the work happens here.
If you join us, we expect that:
- You actively use Claude Code, Cursor, or similar agentic AI in your daily workflow — not as a curiosity, but as your default way of working.
- When handed an unfamiliar system (Hami, MIG partitioning, vLLM internals, a new operator), you go deep on it with AI as your accelerator and come back with answers, prototypes, and opinions.
- You take ownership. You don't wait to be told what to do once a problem is in your lane.
- You maintain curiosity and learning velocity.
What You'll Own
You will own the infrastructure and orchestration layer that runs AI/ML workloads on our GPU clusters. Specifically:
- Operate and evolve our Kubernetes-based GPU cluster — scheduling, autoscaling, GPU partitioning (MIG, Hami, time-slicing), workload isolation.
- Integrate and tune the model-serving stack (vLLM, NVIDIA Triton, llm-d, and similar) for cost, latency, and throughput.
- Build observability for GPU utilization, workload performance, and cluster health (Prometheus, Grafana, OpenTelemetry, DCGM).
- Own infrastructure-as-code and CI/CD for the platform (Terraform, Helm, Argo CD, GitHub Actions).
- Implement multi-tenancy, network policy, and cluster-level security in support of our data sovereignty and air-gapped operating modes.
- Partner with our backend engineer on the contracts where the infrastructure layer meets the portal and APIs.
- Help diagnose and resolve production issues across the stack as we onboard pilot customers.
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
What We're Looking For
We care more about what you can actually do than how many years it took you to get there. The bar is:
- You have operated production Kubernetes end-to-end — upgrades, autoscaling, debugging incidents — not just deployed workloads to a cluster someone else runs.
- Strong Linux, networking, and systems fundamentals.
- Comfortable in at least one of Go, Python, or Rust for tooling, controllers, or operators.
- Hands-on with infrastructure-as-code, observability tooling, and CI/CD pipelines.
- Strong written English and clear, low-ego communication.
- 4+ years in infrastructure, platform, or SRE roles is a useful soft floor, not a hard gate.
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
Bonus Points
None of these are required. They will accelerate your ramp-up, and we are happy to hire someone with strong fundamentals and zero GPU experience who is excited to learn.
- Hands-on with the NVIDIA stack (CUDA, MIG, NCCL, Triton, DCGM).
- Experience with GPU partitioning or virtualization (Hami, MIG, time-slicing, MPS).
- Serving LLMs in production (vLLM, llm-d, TGI, SGLang).
- Multi-tenancy or compliance work in regulated industries.
- Background in ML platforms, developer infrastructure, or internal PaaS-style products.
- Familiarity with usage-based metering and quota systems.
What You Get
- Ownership of a critical layer of the platform from day one.
- Direct work with the founders and a small, AI-native engineering team — no bureaucracy, no political layers.
- A real customer pipeline in regulated markets — pilots, not vaporware.
- Fully remote, globally. We hire for talent and attitude. Working hours should overlap meaningfully with the team.
- Compensation is discussed during the hiring process and is competitive for the candidate's market.
Similar Jobs
Explore other opportunities that match your interests
Bright Vision Technologies
bankable wisdom
Senior/Staff Site Reliability Engineer