Research Engineer for Large-Scale RL Training Infrastructure

Prime Intellect • United State

Visa Sponsorship Relocation

Apply

AI Summary

We are seeking a Research Engineer to work on the systems layer behind large-scale RL training. This role involves building and optimizing the systems infrastructure, improving end-to-end training efficiency, and designing low-level performance optimizations. The ideal candidate has strong systems engineering experience in AI/ML infrastructure and deep familiarity with PyTorch and distributed training frameworks.

Key Highlights

Build and optimize systems infrastructure for large-scale RL training

Improve end-to-end training efficiency

Design low-level performance optimizations

Key Responsibilities

Build and optimize the systems infrastructure behind large-scale RL and distributed training workloads

Improve end-to-end training efficiency across compute, memory, networking, and scheduling layers

Design and implement low-level performance optimizations

Technical Skills Required

PyTorch Distributed training frameworks (PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray) CUDA / Triton kernels Compiler or runtime optimization for ML systems

Benefits & Perks

Competitive compensation, including equity

Flexible work arrangements

Visa sponsorship and relocation support for international candidates

Nice to Have

Experience writing or optimizing CUDA / Triton kernels

Experience with compiler or runtime optimization for ML systems

Experience working on RL training infrastructure, rollout systems, or asynchronous training pipelines

Job Description

Building Open Superintelligence Infrastructure

Prime Intellect is building the open superintelligence stack: from frontier agentic models to the infrastructure that enables anyone to train, adapt, and deploy them.

We unify globally distributed compute into a single control plane and pair it with the full reinforcement learning post-training stack: environments, secure sandboxes, verifiable evaluations, and our async RL trainer. We enable researchers, startups, and enterprises to run end-to-end RL at frontier scale, adapting models to real tools, workflows, and deployment environments.

We are looking for a Research Engineer to work on the systems layer behind large-scale RL training. This role is for someone who enjoys going deep on performance: optimizing kernels, improving memory and communication efficiency, scaling distributed workloads, and pushing the throughput and reliability of training systems closer to hardware limits.

If you care about making large-scale model training faster, cheaper, and more robust, we’d love to talk.

What You’ll Work On

Build and optimize the systems infrastructure behind large-scale RL and distributed training workloads.
Improve end-to-end training efficiency across compute, memory, networking, and scheduling layers.
Design and implement low-level performance optimizations, including kernels, communication paths, and runtime improvements.
Work on distributed training systems spanning data, tensor, and pipeline parallel workloads.
Help shape the architecture of our RL training stack, including async rollout and post-training systems.
Contribute to open-source libraries and internal infrastructure used for frontier-scale model training.

Looking to advance your Machine Learning & AI career with relocation support? Explore Machine Learning & AI Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.

Collaborate closely with researchers and infrastructure engineers to translate bottlenecks into concrete systems improvements.
Stay at the frontier of training systems, inference systems, compiler/runtime tooling, and hardware-aware optimization techniques.

You May Be a Fit If You Have

Strong systems engineering experience in AI/ML infrastructure, especially around large-scale model training or inference.
Deep familiarity with PyTorch and distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray, or related tooling.
Experience optimizing training performance across kernels, memory movement, communication overhead, or parallelization strategy.
Hands-on experience with large-scale training techniques including data parallelism, tensor parallelism, and pipeline parallelism.
Strong understanding of GPU architecture, profiling, and performance debugging.
Ability to identify bottlenecks across the stack and drive improvements from first principles.
Comfort working in a fast-moving environment with ambiguous problems and high ownership.

Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.

Especially Exciting

Experience writing or optimizing CUDA / Triton kernels.
Experience with compiler or runtime optimization for ML systems.
Experience working on RL training infrastructure, rollout systems, or asynchronous training pipelines.
Experience with multi-node GPU clusters and high-performance networking.
Contributions to open-source ML systems or infrastructure projects.
Interest in publishing technical work or sharing insights through engineering blogs and technical writing.

Why This Role Matters

Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.

The next frontier in AI will not be unlocked by models alone. It will be unlocked by systems that let those models train faster, adapt continuously, and operate across real environments at scale.

That infrastructure does not exist yet in the form the world needs.

We’re building it.

Benefits & Perks

Competitive compensation, including equity.
Flexible work arrangements, with the option to work remotely or in person from our San Francisco office.
Visa sponsorship and relocation support for international candidates.
Quarterly team offsites, hackathons, conferences, and learning opportunities.
A deeply technical, high-agency team working on infrastructure for open superintelligence.

If you’re excited about building the systems foundation for frontier-scale RL and open superintelligence, we’d love to hear from you.

Job Overview

Posted Date Mar 28, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Machine Learning

Company Prime Intellect

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Senior Machine Learning Engineer, Internet Security Research

Machine Learning

•

1d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Palo Alto Networks

United State

Prompt Engineer

Machine Learning

•

1d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

GE Vernova

United State

Senior Machine Learning Engineer - Medical Imaging

Machine Learning

•

1d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

ut md anderson

United State

Research Engineer for Large-Scale RL Training Infrastructure

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Senior Machine Learning Engineer, Internet Security Research

Premium Job

Palo Alto Networks

Prompt Engineer

Premium Job

GE Vernova

Senior Machine Learning Engineer - Medical Imaging

Premium Job

ut md anderson

Subscribe our newsletter