Join a well-funded, fast-growing deep-tech company as a Machine Learning Manager to lead a critical initiative within the Platform Engineering team. Design and develop the core software layer for AI infrastructure, architect sophisticated scheduling solutions, and deeply tune the software-defined networking layer.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Machine Learning Manager | AI Cloud Infrastructure | Kubernetes | AI / Quantum Computing Start-up
Location: Munich, Germany (Hybrid)
Join a well-funded, fast-growing deep-tech company that's at the forefront of innovation. Recognised as a leader in quantum software within Europe and have been highlighted as one of the most promising AI companies globally. The vibrant, multicultural, and international team is rapidly expanding, currently comprising over 150 dedicated professionals.
We are looking for an experienced and skilled Senior Systems Engineer with strong experience in Cloud Infrastructure to lead a critical initiative within the Platform Engineering team
This role sits squarely at the intersection of High-Performance Computing (HPC), Kubernetes Internals, and Bare Metal Systems Engineering.
Key Responsibilities
- Building the Control Plane: Design and develop the core software layer (APIs, Controllers, Agents) that automates the entire lifecycle of bare-metal AI infrastructure.
- Massive-Scale Orchestration: Architect sophisticated scheduling solutions for distributed training jobs across enormous GPU clusters (NVIDIA H200/B200/B300), focusing on efficient bin-packing and gang scheduling.
- Optimising the Fabric: Deeply tune the software-defined networking layer to ensure ultra-low-latency interconnects (InfiniBand/RDMA/RoCEv2) crucial for multi-node training performance.
- Kubernetes Customisation: Extend Kubernetes by developing custom Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) for data scientists.
- Deep Systems Debugging: Troubleshoot and resolve complex issues at the kernel and hardware level, from PCIe bus errors and NCCL timeouts to kernel panics on bare-metal nodes.
- Defining Standards: Create the "Golden Image" for AI workloads, mastering driver, firmware, and OS optimisations to extract maximum performance from the hardware.
Required Skills:
- Systems Programming Mastery (10+ Years): Proven software engineering experience with deep proficiency in Go (Golang), C++, or Rust. Must be adept at building robust system agents, APIs, and CLI tools.
- Kubernetes Internals: Understanding K8s beyond standard deployment—experience with Custom Resource Definitions (CRDs), Operators, and the Kubernetes API server architecture is critical.
- GPU Ecosystem Expertise: Direct, hands-on experience managing and optimising NVIDIA GPU clusters, including familiarity with drivers, CUDA toolkit, and the NVIDIA Container Toolkit.
- Linux Deep Dive: Deep understanding of the Linux kernel, including cgroups, namespaces, and system performance tuning.
- Bare Metal IaC: Mastery of declarative infrastructure tools (Terraform, Ansible), specifically focused on provisioning and configuring physical hardware, not just public cloud VMs.
- Elite Problem Solving: A demonstrable track record of debugging complex, distributed systems where the root cause might be code, networking, or silicon.
Perks & Compensation
- Competitive Package: Indefinite contract, guaranteed equal pay, and a variable performance bonus.
- Immediate Incentives: Signing bonus and relocation package (where applicable).
- Support & Flexibility: Private health insurance, generous educational budget, hybrid working opportunity, and flexible hours.
- The Environment: Join a high-performance, collaborative team operating at pace on the absolute cutting edge of AI and deep-tech infrastructure.
Interested? Apply directly through LinkedIn, or send your CV to george@eu-recruit.com
By applying to this role you understand that we may collect your personal data and store and process it on our systems. For more information please see our Privacy Notice https://eu-recruit.com/wp-content/uploads/2024/07/European-Tech-Recruit-Privacy-Notice-2024.pdf
Similar Jobs
Explore other opportunities that match your interests
Senior Python Software Engineer
Distribusion Technologies
richter biologics