Senior AI Systems Engineer

botify tech Germany
Relocation
Apply
AI Summary

Botify Tech is seeking a Senior AI Systems Engineer to design and develop software for bare-metal AI infrastructure. The ideal candidate will have 7+ years of experience in software engineering and proficiency in Golang, C++, or Rust.

Key Highlights
Designing and developing software for bare-metal AI infrastructure
Developing software layer (APIs, Controllers, Agents) that automates the lifecycle of AI infrastructure
Managing NVIDIA GPU clusters and familiarity with NVIDIA Container Toolkit
Technical Skills Required
Golang C++ Rust Kubernetes NVIDIA GPU clusters NVIDIA Container Toolkit Terraform Ansible
Benefits & Perks
€60,000 - €85,000 salary
20% Bonus
Relocation Package (if applicable)
Signing Bonus
Hybrid work arrangement (3 days/week in office)

Job Description


AI Senior Systems Engineer - Munich

€60,000 - €85,000 + 20% Bonus + Relocation Package (if applicable) + Signing Bonus

Location Munich – Hybrid must attend Office 3 days a week


Botify Tech has partnered with 1 of the TOP 10 AI businesses in EUROPE, looking for a Senior Systems Engineer.


Key Skills & Experience Required

  • 7+ years of software engineering experience with strong proficiency in Golang, C++, or Rust.
  • Designing and developing the software layer (APIs, Controllers, Agents) that automates the lifecycle of bare-metal AI infrastructure.
  • Deep experience with K8s internals beyond simple deployment.
  • Hands-on experience managing NVIDIA GPU clusters, familiarity with NVIDIA Container Toolkit.
  • Writing custom Kubernetes Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) into usable interfaces for our AI engineers.
  • Deep Experience within Terraform, Ansible but with a focus on provisioning physical hardware rather than just cloud VMs.
  • Architecting scheduling solutions for large-scale distributed training jobs across massive clusters of GPUs (NVIDIA H200/B200/B300), ensuring efficient bin-packing and gang scheduling.
  • Tuning the software-defined networking layer to support low-latency interconnects (InfiniBand/RDMA/RoCEv2) is essential for multi-node training.
  • Investigating and resolving deep system issues, ranging from PCIe bus errors and NCCL communication timeouts to kernel panics on bare-metal nodes.
  • Creating the "Golden Image" for AI workloads, managing drivers, firmware, and OS optimizations to squeeze maximum performance out of the hardware.


For more info

If you feel this is the role for you or you know someone suitable for the role.

Email me at pinto@botifytech.com


Subscribe our newsletter

New Things Will Always Update Regularly