Machine Learning Manager - AI Cloud Infrastructure

European Tech Recruit • Germany

Relocation

This Job is No Longer Active This position is no longer accepting applications

AI Summary

Join a well-funded, fast-growing deep-tech company as a Machine Learning Manager to lead a critical initiative within the Platform Engineering team. Design and develop the core software layer for bare-metal AI infrastructure, architect sophisticated scheduling solutions, and extend Kubernetes by developing custom Operators and CRDs.

Key Highlights

Lead a critical initiative within the Platform Engineering team

Design and develop the core software layer for bare-metal AI infrastructure

Architect sophisticated scheduling solutions for distributed training jobs

Technical Skills Required

Python Go (Golang) OpenStack Kubernetes NVIDIA GPU clusters Linux Internals Infrastructure as Code (Terraform, Ansible)

Benefits & Perks

Competitive Package

Indefinite contract

Guaranteed equal pay

Variable performance bonus

Signing bonus

Relocation package

Private health insurance

Generous educational budget

Hybrid working opportunity

Flexible hours

Job Description

Machine Learning Manager | AI Cloud Infrastructure | Kubernetes | AI / Quantum Computing Start-up

Location: Munich, Germany (Hybrid)

Join a well-funded, fast-growing deep-tech company that's at the forefront of innovation. Recognised as a leader in quantum software within Europe and have been highlighted as one of the most promising AI companies globally. The vibrant, multicultural, and international team is rapidly expanding, currently comprising over 150 dedicated professionals.

We are looking for an experienced and skilled Senior Systems Engineer with strong experience in Cloud Infrastructure to lead a critical initiative within the Platform Engineering team

This role sits squarely at the intersection of High-Performance Computing (HPC), Kubernetes Internals, and Bare Metal Systems Engineering.

Key Responsibilities

Building the Control Plane: Design and develop the core software layer (APIs, Controllers, Agents) that automates the entire lifecycle of bare-metal AI infrastructure.
Massive-Scale Orchestration: Architect sophisticated scheduling solutions for distributed training jobs across enormous GPU clusters (NVIDIA H200/B200/B300), focusing on efficient bin-packing and gang scheduling.
Optimising the Fabric: Deeply tune the software-defined networking layer to ensure ultra-low-latency interconnects (InfiniBand/RDMA/RoCEv2) crucial for multi-node training performance.
Kubernetes Customisation: Extend Kubernetes by developing custom Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) for data scientists.
Deep Systems Debugging: Troubleshoot and resolve complex issues at the kernel and hardware level, from PCIe bus errors and NCCL timeouts to kernel panics on bare-metal nodes.
Defining Standards: Create the "Golden Image" for AI workloads, mastering driver, firmware, and OS optimisations to extract maximum performance from the hardware.

Required Skills:

Software Engineering Expertise: 10+ years of software engineering experience. Strong proficiency in Python is a must. Experience with Go (Golang) is a plus. You must be comfortable building system agents, APIs, and CLI tools.
OpenStack Expertise: Deep architectural and operational knowledge of OpenStack. You must be proficient with core components (Nova, Neutron, Keystone, Glance) and understand how to manage large-scale deployments.
Deep Kubernetes Knowledge: You understand K8s internals beyond simple deployment. Experience with Custom Resource Definitions (CRDs), Operators, and the Kubernetes API server architecture.
GPU Ecosystem Experience: Hands-on experience managing NVIDIA GPU clusters. Familiarity with NVIDIA drivers, CUDA toolkit, and the container runtime (NVIDIA Container Toolkit).
Linux Internals: Deep understanding of the Linux kernel, cgroups, namespaces, and system performance tuning.
Infrastructure as Code: Mastery of declarative infrastructure tools (Terraform, Ansible) but with a focus on provisioning physical hardware rather than just cloud VMs.
Problem Solving: A proven track record of debugging complex distributed systems where the root cause could be code, network, or silicon.

Preferred qualifications

Multi-Tenant Cloud Development: Proven experience architecting and developing secure multi-tenant cloud environments using OpenStack.
HPC Background: Experience working with traditional supercomputing schedulers (Slurm, PBS) or modern batch schedulers (Volcano, Kueue, Ray).
Bare Metal Provisioning: Experience with OpenStack Ironic or similar tools like Cluster API (CAPI), Metal3, Tinkerbell, or Canonical MaaS.
High-Speed Networking: Knowledge of RDMA, InfiniBand, GPUDirect, and how to expose these technologies to containerized workloads.
AI/ML Familiarity: Understanding of how distributed training works (e.g., PyTorch Distributed, Megatron-bridge, DeepSpeed) and the infrastructure requirements of Large Language Models (LLMs).

Perks & Compensation

Competitive Package: Indefinite contract, guaranteed equal pay, and a variable performance bonus.
Immediate Incentives: Signing bonus and relocation package (where applicable).
Support & Flexibility: Private health insurance, generous educational budget, hybrid working opportunity, and flexible hours.
The Environment: Join a high-performance, collaborative team operating at pace on the absolute cutting edge of AI and deep-tech infrastructure.

Interested? Apply directly through LinkedIn, or send your CV to george@eu-recruit.com

By applying to this role you understand that we may collect your personal data and store and process it on our systems. For more information please see our Privacy Notice https://eu-recruit.com/wp-content/uploads/2024/07/European-Tech-Recruit-Privacy-Notice-2024.pdf

Job Overview

Posted Date Jan 19, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location Germany

Category Devops

Company European Tech Recruit

Mentioned Skills

Industries

Similar Jobs

Explore other opportunities that match your interests

Full-Stack Engineer

Devops

•

1d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

mercura (yc w25)

Germany

Founding Staff Engineer

Devops

•

2d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

mercura (yc w25)

Germany

Principal Engineer, AdTech

Devops

•

2d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Delivery Hero

Germany

Machine Learning Manager - AI Cloud Infrastructure

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Full-Stack Engineer

Premium Job

mercura (yc w25)

Founding Staff Engineer

Premium Job

mercura (yc w25)

Principal Engineer, AdTech

Premium Job

Delivery Hero

Subscribe our newsletter