Machine Learning Manager - AI Cloud Infrastructure

European Tech Recruit • Germany
Relocation
This Job is No Longer Active This position is no longer accepting applications
AI Summary

Join a well-funded, fast-growing deep-tech company as a Machine Learning Manager to lead a critical initiative within the Platform Engineering team. Design and develop the core software layer for bare-metal AI infrastructure, architect sophisticated scheduling solutions, and extend Kubernetes by developing custom Operators and CRDs.

Key Highlights
Lead a critical initiative within the Platform Engineering team
Design and develop the core software layer for bare-metal AI infrastructure
Architect sophisticated scheduling solutions for distributed training jobs
Technical Skills Required
Python Go (Golang) OpenStack Kubernetes NVIDIA GPU clusters Linux Internals Infrastructure as Code (Terraform, Ansible)
Benefits & Perks
Competitive Package
Indefinite contract
Guaranteed equal pay
Variable performance bonus
Signing bonus
Relocation package
Private health insurance
Generous educational budget
Hybrid working opportunity
Flexible hours

Job Description


Machine Learning Manager | AI Cloud Infrastructure | Kubernetes | AI / Quantum Computing Start-up


Location: Munich, Germany (Hybrid)


Join a well-funded, fast-growing deep-tech company that's at the forefront of innovation. Recognised as a leader in quantum software within Europe and have been highlighted as one of the most promising AI companies globally. The vibrant, multicultural, and international team is rapidly expanding, currently comprising over 150 dedicated professionals.


We are looking for an experienced and skilled Senior Systems Engineer with strong experience in Cloud Infrastructure to lead a critical initiative within the Platform Engineering team


This role sits squarely at the intersection of High-Performance Computing (HPC), Kubernetes Internals, and Bare Metal Systems Engineering.


Key Responsibilities

  • Building the Control Plane: Design and develop the core software layer (APIs, Controllers, Agents) that automates the entire lifecycle of bare-metal AI infrastructure.
  • Massive-Scale Orchestration: Architect sophisticated scheduling solutions for distributed training jobs across enormous GPU clusters (NVIDIA H200/B200/B300), focusing on efficient bin-packing and gang scheduling.
  • Optimising the Fabric: Deeply tune the software-defined networking layer to ensure ultra-low-latency interconnects (InfiniBand/RDMA/RoCEv2) crucial for multi-node training performance.
  • Kubernetes Customisation: Extend Kubernetes by developing custom Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) for data scientists.
  • Deep Systems Debugging: Troubleshoot and resolve complex issues at the kernel and hardware level, from PCIe bus errors and NCCL timeouts to kernel panics on bare-metal nodes.
  • Defining Standards: Create the "Golden Image" for AI workloads, mastering driver, firmware, and OS optimisations to extract maximum performance from the hardware.


Required Skills:

  • Software Engineering Expertise: 10+ years of software engineering experience. Strong proficiency in Python is a must. Experience with Go (Golang) is a plus. You must be comfortable building system agents, APIs, and CLI tools.
  • OpenStack Expertise: Deep architectural and operational knowledge of OpenStack. You must be proficient with core components (Nova, Neutron, Keystone, Glance) and understand how to manage large-scale deployments.
  • Deep Kubernetes Knowledge: You understand K8s internals beyond simple deployment. Experience with Custom Resource Definitions (CRDs), Operators, and the Kubernetes API server architecture.
  • GPU Ecosystem Experience: Hands-on experience managing NVIDIA GPU clusters. Familiarity with NVIDIA drivers, CUDA toolkit, and the container runtime (NVIDIA Container Toolkit).
  • Linux Internals: Deep understanding of the Linux kernel, cgroups, namespaces, and system performance tuning.
  • Infrastructure as Code: Mastery of declarative infrastructure tools (Terraform, Ansible) but with a focus on provisioning physical hardware rather than just cloud VMs.
  • Problem Solving: A proven track record of debugging complex distributed systems where the root cause could be code, network, or silicon.


Preferred qualifications

  • Multi-Tenant Cloud Development: Proven experience architecting and developing secure multi-tenant cloud environments using OpenStack.
  • HPC Background: Experience working with traditional supercomputing schedulers (Slurm, PBS) or modern batch schedulers (Volcano, Kueue, Ray).
  • Bare Metal Provisioning: Experience with OpenStack Ironic or similar tools like Cluster API (CAPI), Metal3, Tinkerbell, or Canonical MaaS.
  • High-Speed Networking: Knowledge of RDMA, InfiniBand, GPUDirect, and how to expose these technologies to containerized workloads.
  • AI/ML Familiarity: Understanding of how distributed training works (e.g., PyTorch Distributed, Megatron-bridge, DeepSpeed) and the infrastructure requirements of Large Language Models (LLMs).


Perks & Compensation

  • Competitive Package: Indefinite contract, guaranteed equal pay, and a variable performance bonus.
  • Immediate Incentives: Signing bonus and relocation package (where applicable).
  • Support & Flexibility: Private health insurance, generous educational budget, hybrid working opportunity, and flexible hours.
  • The Environment: Join a high-performance, collaborative team operating at pace on the absolute cutting edge of AI and deep-tech infrastructure.


Interested? Apply directly through LinkedIn, or send your CV to george@eu-recruit.com


By applying to this role you understand that we may collect your personal data and store and process it on our systems. For more information please see our Privacy Notice https://eu-recruit.com/wp-content/uploads/2024/07/European-Tech-Recruit-Privacy-Notice-2024.pdf


Similar Jobs

Explore other opportunities that match your interests

Full-Stack Engineer

Devops
•
1d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

mercura (yc w25)

Germany

Founding Staff Engineer

Devops
•
2d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

mercura (yc w25)

Germany

Principal Engineer, AdTech

Devops
•
2d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Delivery Hero

Germany

Subscribe our newsletter

New Things Will Always Update Regularly