Senior AI Infrastructure Engineer (GPU Clusters)

Doghouse Recruitment โ€ข United State
Remote
Apply
AI Summary

We are looking for a highly skilled Senior AI Infrastructure Engineer to build automation and lifecycle systems for a global GPU-cluster fleet. As a key member of the team, you will design and develop backend services and automation tooling in Python. You will work with cutting-edge NVIDIA hardware and have a strong ownership mindset and clear communication skills.

Key Highlights
Building automation and lifecycle systems for a global GPU-cluster fleet
Design and develop backend services and automation tooling in Python
Work with cutting-edge NVIDIA hardware
Key Responsibilities
Design and develop backend services and automation tooling in Python
Build and maintain provisioning, testing, and lifecycle management systems for physical hardware
Integrate with Linux systems using shell scripting and low-level tooling
Technical Skills Required
Python Linux Shell scripting CI/CD pipelines NoSQL databases ARM64 architectures
Benefits & Perks
Up to 250K base salary
Remote work within US/Canada
Nice to Have
Experience at large infrastructure scale with ARM platforms in production
Background in infrastructure automation, internal platform tooling, or open-source systems software

Job Description


AI Infrastructure Engineer โ€“ GPU Clusters - Up to 250K Base - Remote


This position is open to candidates working remotely in the United States or Canada.


Our client is a cloud technology company driving the next generation of AI infrastructure. They empower organizations to build and scale AI and ML solutions without the need for large in-house teams or heavy upfront infrastructure costs. Their global team of engineers works at the forefront of GPU cloud computing, supporting businesses across industries to solve complex, real-world problems.


The company operates with a flat structure, minimal bureaucracy, and a strong focus on ownership, speed, and technical excellence. Engineers work closely with customers and internal teams to design scalable solutions and influence product direction, creating direct impact on how modern AI platforms are built and operated.


The Role

They are looking for someone to build the automation and lifecycle systems that power a global, large-scale GPU-cluster fleet. This is a hands-on engineering role at the intersection of software and physical infrastructure.


You will work with cutting-edge NVIDIA hardware that most engineers never get close to, and you'll be helping design systems that often get redesigned within weeks: because that's the pace. If you thrive in environments where speed, autonomy, and real engineering ownership matter, this role is for you.


Responsibilities

  • Design and develop backend services and automation tooling in Python
  • Build and maintain provisioning, testing, and lifecycle management systems for physical hardware, including software that runs directly on bare-metal environments
  • Integrate with Linux systems using shell scripting and low-level tooling, and implement CI/CD pipelines for infrastructure-focused software
  • Work across networking layers (IPv4/IPv6, DHCP, DNS, network boot) and interface with hardware management controllers and their protocols
  • Design NoSQL data stores for system state and orchestration
  • Support ARM64 architectures and contribute clear documentation and operational excellence across large machine fleets


What You'll Bring

  • 10+ years of professional experience.
  • Strong Python engineering experience with a solid Linux and shell scripting background
  • Hands-on familiarity with bare-metal servers, networking fundamentals, and hardware management interfaces and APIs
  • Experience with CI/CD pipelines and NoSQL databases
  • The ability to debug complex issues spanning software, hardware, and networks
  • A strong ownership mindset and clear communication skills in a distributed team


Nice to Have

  • Experience at large infrastructure scale, with ARM platforms in production, or in hardware testing and factory provisioning
  • A background in infrastructure automation, internal platform tooling, or open-source systems software


Interview Process

  1. Preliminary interview
  2. Technical coding interview
  3. Final technical deep dive


The Offer

  • Base salary up to 250K USD plus bonus and RSUs
  • Remote role within the US/Canada
  • No take-home assignment throughout the process


Similar Jobs

Explore other opportunities that match your interests

Talent Partner - Accounting and Finance

Programming
โ€ข
3h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Associate

3 bridge networks

United State

MAPLE Accelerators Systems Engineer

Programming
โ€ข
4h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Entry level

andrew an amphenol company

United State

Director of Development & Communications

Programming
โ€ข
4h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Director

new seneca village

United State

Subscribe our newsletter

New Things Will Always Update Regularly