Job Description
Distributed Systems Engineer
Introduction: Join our mission-driven team as a Distributed Systems Engineer. This role focuses on building data and coordination systems that enable ultra-long context inference and training on our GPU clusters. We aim to create safe AGI to accelerate progress on critical global issues by automating research and code generation.
About Us: We are dedicated to developing safe AGI by leveraging frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and test-time compute. Our approach aims to automate and improve models, solving alignment challenges more reliably than human efforts alone. We value integrity, hands-on collaboration, teamwork, focus, and delivering high-quality results.
About the Role: As a Distributed Systems Engineer, you will be responsible for developing high-performance storage and caching systems to support long-context inference and training. You will work on the internals of deep learning frameworks in distributed settings, automate fault detection and recovery systems, and troubleshoot complex issues across GPUs, networks, storage, OS, and cloud environments.
What We Can Offer You:
- Competitive Annual Salary Range
- Significant equity as part of total compensation
- 401(k) plan with 6% salary matching
- Comprehensive health, dental, and vision insurance for you and your dependents
- Unlimited paid time off
- Option to work in-person in SF or remotely
- Visa sponsorship and relocation stipend to SF
Key Responsibilities:
- Build and maintain high-performance storage and caching systems
- Develop and optimize the internals of deep learning frameworks in a distributed environment
- Automate fault detection and recovery systems for high availability
- Troubleshoot and resolve complex issues across GPUs, networks, storage, OS, and cloud environments
- Collaborate with a small, focused team to achieve our mission
Relevant Keywords: AGI, GPU Clusters, Python, Typescript, Go, Rust, C++, LLM, Large Language Model, Scalable Software Design, TensorFlow, Artificial Intelligence, Deep Learning, Statistical Modeling, Algorithms