Join a well-funded, fast-growing deep-tech company as a Machine Learning Manager to lead a critical initiative within the Platform Engineering team. Design and develop the core software layer for bare-metal AI infrastructure, architect sophisticated scheduling solutions, and extend Kubernetes by developing custom Operators and CRDs.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Machine Learning Manager | AI Cloud Infrastructure | Kubernetes | AI / Quantum Computing Start-up
Location: Munich, Germany (Hybrid)
Join a well-funded, fast-growing deep-tech company that's at the forefront of innovation. Recognised as a leader in quantum software within Europe and have been highlighted as one of the most promising AI companies globally. The vibrant, multicultural, and international team is rapidly expanding, currently comprising over 150 dedicated professionals.
We are looking for an experienced and skilled Senior Systems Engineer with strong experience in Cloud Infrastructure to lead a critical initiative within the Platform Engineering team
This role sits squarely at the intersection of High-Performance Computing (HPC), Kubernetes Internals, and Bare Metal Systems Engineering.
Key Responsibilities
- Building the Control Plane: Design and develop the core software layer (APIs, Controllers, Agents) that automates the entire lifecycle of bare-metal AI infrastructure.
- Massive-Scale Orchestration: Architect sophisticated scheduling solutions for distributed training jobs across enormous GPU clusters (NVIDIA H200/B200/B300), focusing on efficient bin-packing and gang scheduling.
- Optimising the Fabric: Deeply tune the software-defined networking layer to ensure ultra-low-latency interconnects (InfiniBand/RDMA/RoCEv2) crucial for multi-node training performance.
- Kubernetes Customisation: Extend Kubernetes by developing custom Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) for data scientists.
- Deep Systems Debugging: Troubleshoot and resolve complex issues at the kernel and hardware level, from PCIe bus errors and NCCL timeouts to kernel panics on bare-metal nodes.
- Defining Standards: Create the "Golden Image" for AI workloads, mastering driver, firmware, and OS optimisations to extract maximum performance from the hardware.
Required Skills:
- Software Engineering Expertise: 10+ years of software engineering experience. Strong proficiency in Python is a must. Experience with Go (Golang) is a plus. You must be comfortable building system agents, APIs, and CLI tools.
- OpenStack Expertise: Deep architectural and operational knowledge of OpenStack. You must be proficient with core components (Nova, Neutron, Keystone, Glance) and understand how to manage large-scale deployments.
- Deep Kubernetes Knowledge: You understand K8s internals beyond simple deployment. Experience with Custom Resource Definitions (CRDs), Operators, and the Kubernetes API server architecture.
- GPU Ecosystem Experience: Hands-on experience managing NVIDIA GPU clusters. Familiarity with NVIDIA drivers, CUDA toolkit, and the container runtime (NVIDIA Container Toolkit).
- Linux Internals: Deep understanding of the Linux kernel, cgroups, namespaces, and system performance tuning.
- Infrastructure as Code: Mastery of declarative infrastructure tools (Terraform, Ansible) but with a focus on provisioning physical hardware rather than just cloud VMs.
- Problem Solving: A proven track record of debugging complex distributed systems where the root cause could be code, network, or silicon.
Preferred qualifications
- Multi-Tenant Cloud Development: Proven experience architecting and developing secure multi-tenant cloud environments using OpenStack.
- HPC Background: Experience working with traditional supercomputing schedulers (Slurm, PBS) or modern batch schedulers (Volcano, Kueue, Ray).
- Bare Metal Provisioning: Experience with OpenStack Ironic or similar tools like Cluster API (CAPI), Metal3, Tinkerbell, or Canonical MaaS.
- High-Speed Networking: Knowledge of RDMA, InfiniBand, GPUDirect, and how to expose these technologies to containerized workloads.
- AI/ML Familiarity: Understanding of how distributed training works (e.g., PyTorch Distributed, Megatron-bridge, DeepSpeed) and the infrastructure requirements of Large Language Models (LLMs).
Perks & Compensation
- Competitive Package: Indefinite contract, guaranteed equal pay, and a variable performance bonus.
- Immediate Incentives: Signing bonus and relocation package (where applicable).
- Support & Flexibility: Private health insurance, generous educational budget, hybrid working opportunity, and flexible hours.
- The Environment: Join a high-performance, collaborative team operating at pace on the absolute cutting edge of AI and deep-tech infrastructure.
Interested? Apply directly through LinkedIn, or send your CV to george@eu-recruit.com
By applying to this role you understand that we may collect your personal data and store and process it on our systems. For more information please see our Privacy Notice https://eu-recruit.com/wp-content/uploads/2024/07/European-Tech-Recruit-Privacy-Notice-2024.pdf
Similar Jobs
Explore other opportunities that match your interests
Full-Stack Engineer
mercura (yc w25)
Founding Staff Engineer
mercura (yc w25)
Principal Engineer, AdTech