Design and optimize ML systems, build repeatable execution patterns, and troubleshoot scalability and reliability bottlenecks. Expertise in distributed systems, infrastructure as code, and systems fluency required. Collaborate with cross-functional teams to bridge hardware capabilities and developer experience.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
*** This role is not a remote job; it is on-site. You must be willing to relocate for this role***
The Mission
We are representing a pioneering AI Infrastructure & Cloud Services firm dedicated to dismantling the barriers of large-scale AI innovation. Our client is creating seamless, resilient, and secure environments for the world’s builders.
The Role
As a Senior Machine Learning Engineer, you’ll be architecting and operating the core systems that power massive-scale distributed training and inference. You will sit at the intersection of workload orchestration, cluster operations, and performance engineering.
Core Responsibilities
- Architect Infrastructure: Design and optimize ML systems that support massive distributed training and high-concurrency inference workloads.
- Orchestration & Reliability: Build repeatable execution patterns across shared, high-density compute environments.
- Performance Engineering: Troubleshoot and resolve complex scalability and reliability bottlenecks.
- Cross-Functional Partnership: Collaborate with Systems and Platform teams to bridge the gap between hardware capabilities and developer experience.
Looking to advance your Machine Learning & AI career with relocation support? Explore Machine Learning & AI Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
Technical Profile
- Expertise in Distributed Systems: Proven experience managing ML workloads across large-scale clusters.
- Infrastructure as Code: Proficiency in orchestrating GPU-accelerated environments (Kubernetes, SLURM).
- Systems Fluency: Deep understanding of the Linux networking stack, drivers, and low-level performance tuning.
- Scale Mindset: Experience solving problems that only emerge when moving from "handfuls of devices" to massive, warehouse-scale compute.
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
Benefits
- Competitive industry salary
- RSUs
- 100% paid insurance plans (medical, dental, vision)
- PTO
- Paid Holidays
- 401(k)
- Paternity/Maternity Leave
- FSA
- STDI
- Life Insurance
- Mental Health Support
Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.
Why This Role?
This isn't a "maintenance" job. Our client is solving problems that haven't been solved yet. You will be pushed daily to innovate and build the infrastructure that will define the next decade of AI development.
Agency Note: This position is exclusive and requires relocation due to the presence demanded by massive growth. We prioritize candidates who demonstrate a "builder" mindset and a deep commitment to scaling a promising company.
*** This role is not a remote job. You must be willing to relocate for this role***
Similar Jobs
Explore other opportunities that match your interests
hirenza
agilegrid solutions
AI/ML Engineering Intern