Collaborate with a top academic research lab to evaluate and improve terminal-based AI agents. Analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration. Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
Role Overview
Collaborating with a top academic research lab focused on advancing AI agents in real-world system environments. We're seeking high-performing software engineers based in Five Eyes countries to rigorously evaluate and improve terminal-based agents through the Terminal-Bench 2.0 benchmark suite. This is a short-term, high-intensity contract ideal for engineers with deep systems-level expertise and a passion for hands-on problem-solving. Due to the complexity of the tasks, high engagement and consistent weekly availability are critical.
Key Responsibilities
- Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
- Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
- Provide detailed, evidence-based reasoning grounded in code structure and terminal behavior
- Synthesize information across files and configurations to assess end-to-end architecture
- Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- 2+ years of hands-on experience at top-tier tech companies, quant firms, or elite startups
- Bachelor’s or Master’s in Computer Science or related field from a top 50–100 global university
- Deep familiarity with terminal workflows, Linux environments, and shell scripting
- Strong knowledge of Docker, Git, Python, and distributed systems concepts
- Demonstrated ability to trace, debug, and explain complex system behaviors across multiple files
- Commitment to intellectual honesty, clarity, and rigorous methodology
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
- Submit your resume and brief experience summary
- Qualified applicants will be invited to complete a short-form technical assessment
- We typically follow up within 3–5 business days with next steps
- You will be engaged as an independent contractor.
- This is a fully remote role that can be completed on your own schedule.
- Projects can be extended, shortened, or concluded early depending on needs and performance.
Similar Jobs
Explore other opportunities that match your interests
Red Cell Partners
Senior DevSecOps/Platform Security Engineer
DEFCON AI