Job Description
About the Company
Our client is a fast-scaling US-based Sovereign Neo Cloud provider, building a next-generation GPU-as-a-Service (GPUaaS) platform to support national-scale AI and HPC workloads. With a mission to deliver sovereign, high-performance compute infrastructure, the company is positioned at the forefront of Europe’s AI and cloud ecosystem.
This is a remote role based in the US, Eastern time.
The Role
The Head of AI Infrastructure will lead the design, specification, and deployment of large-scale compute, storage, and networking resources to power the company’s GPUaaS platform. This individual will be accountable for ensuring performance, scalability, and availability of GPU capacity to support AI workloads at a national scale.
This is a senior leadership position, requiring both deep technical expertise in HPC/AI infrastructure and the ability to shape strategy and execution for a critical area of the business.
Key Responsibilities
- Spearhead the design and deployment of compute, storage, and network infrastructure to support AI/ML and HPC workloads at scale.
- Build and optimise the company’s GPUaaS platform to deliver performance and availability for enterprise and national-scale use cases.
- Define infrastructure standards, roadmaps, and strategies aligned to the company’s sovereign cloud mission.
- Lead capacity planning, performance engineering, and infrastructure optimisation across GPU clusters.
- Partner with engineering, product, and operations teams to deliver reliable, scalable, and efficient AI infrastructure services.
- Evaluate emerging technologies, vendors, and approaches to strengthen the platform’s competitive edge.
- Represent the infrastructure function in customer, partner, and stakeholder engagements.
Candidate Profile
The ideal candidate will be a proven leader with a strong track record in designing and scaling large-scale HPC or AI infrastructure platforms. They will combine technical depth with strategic vision, capable of building and leading world-class infrastructure teams.
Essential Experience
- Significant leadership experience in HPC, GPUaaS, or AI/ML infrastructure roles.
- Deep expertise in GPU cluster design, deployment, and optimisation (NVIDIA/AMD).
- Strong background in high-performance compute environments, including storage and high-speed networking (e.g., InfiniBand, Mellanox).
- Hands-on knowledge of cloud-native and container orchestration platforms (Kubernetes, Slurm, etc.).
- Proven ability to deliver scalable infrastructure solutions with high availability and performance.
- Experience managing infrastructure for enterprise or national-scale AI/ML workloads.