Senior Software Engineer - AI Terminal Agent Evaluator

yo hr consultancy • United State
Remote
Apply
AI Summary

Collaborate with a top academic research lab to evaluate and improve terminal-based AI agents. Analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration. Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics.

Key Highlights
Collaborate with a top academic research lab
Evaluate and improve terminal-based AI agents
Contribute high-quality reference solutions and diagnostic insights
Key Responsibilities
Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
Synthesize information across files and configurations to assess end-to-end architecture
Technical Skills Required
Docker Shell scripting Linux system administration Python Distributed systems
Benefits & Perks
Fully remote
Short-term, high-intensity contract
Independent contractor

Job Description


Role Overview

Collaborating with a top academic research lab focused on advancing AI agents in real-world system environments. We're seeking high-performing software engineers based in Five Eyes countries to rigorously evaluate and improve terminal-based agents through the Terminal-Bench 2.0 benchmark suite. This is a short-term, high-intensity contract ideal for engineers with deep systems-level expertise and a passion for hands-on problem-solving. Due to the complexity of the tasks, high engagement and consistent weekly availability are critical.

Key Responsibilities

  • Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
  • Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
  • Provide detailed, evidence-based reasoning grounded in code structure and terminal behavior
  • Synthesize information across files and configurations to assess end-to-end architecture
  • Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics

Ideal Qualifications

  • 2+ years of hands-on experience at top-tier tech companies, quant firms, or elite startups
  • Bachelor’s or Master’s in Computer Science or related field from a top 50–100 global university
  • Deep familiarity with terminal workflows, Linux environments, and shell scripting
  • Strong knowledge of Docker, Git, Python, and distributed systems concepts
  • Demonstrated ability to trace, debug, and explain complex system behaviors across multiple files
  • Commitment to intellectual honesty, clarity, and rigorous methodology

Application Process

  • Submit your resume and brief experience summary
  • Qualified applicants will be invited to complete a short-form technical assessment
  • We typically follow up within 3–5 business days with next steps

Contract and Payment Terms

  • You will be engaged as an independent contractor.
  • This is a fully remote role that can be completed on your own schedule.
  • Projects can be extended, shortened, or concluded early depending on needs and performance.

Skills: docker,scripting,linux,bash,shell scripting

Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Red Cell Partners

United State

Senior DevSecOps/Platform Security Engineer

Devops
•
14m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

DEFCON AI

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Piper Companies

United State

Subscribe our newsletter

New Things Will Always Update Regularly