Senior Software Engineer - AI Terminal Agent Evaluator

yo hr consultancy • United State
Remote
This Job is No Longer Active This position is no longer accepting applications
AI Summary

Collaborate with a top academic research lab to evaluate and improve terminal-based AI agents. Analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration. Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics.

Key Highlights
Collaborate with a top academic research lab
Evaluate and improve terminal-based AI agents
Contribute high-quality reference solutions and diagnostic insights
Key Responsibilities
Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
Synthesize information across files and configurations to assess end-to-end architecture
Technical Skills Required
Docker Shell scripting Linux system administration Python Distributed systems
Benefits & Perks
Fully remote
Short-term, high-intensity contract
Independent contractor

Job Description


Role Overview

Collaborating with a top academic research lab focused on advancing AI agents in real-world system environments. We're seeking high-performing software engineers based in Five Eyes countries to rigorously evaluate and improve terminal-based agents through the Terminal-Bench 2.0 benchmark suite. This is a short-term, high-intensity contract ideal for engineers with deep systems-level expertise and a passion for hands-on problem-solving. Due to the complexity of the tasks, high engagement and consistent weekly availability are critical.

Key Responsibilities

  • Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
  • Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
  • Provide detailed, evidence-based reasoning grounded in code structure and terminal behavior
  • Synthesize information across files and configurations to assess end-to-end architecture
  • Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics

Ideal Qualifications

  • 2+ years of hands-on experience at top-tier tech companies, quant firms, or elite startups
  • Bachelor’s or Master’s in Computer Science or related field from a top 50–100 global university
  • Deep familiarity with terminal workflows, Linux environments, and shell scripting
  • Strong knowledge of Docker, Git, Python, and distributed systems concepts
  • Demonstrated ability to trace, debug, and explain complex system behaviors across multiple files
  • Commitment to intellectual honesty, clarity, and rigorous methodology

Application Process

  • Submit your resume and brief experience summary
  • Qualified applicants will be invited to complete a short-form technical assessment
  • We typically follow up within 3–5 business days with next steps

Contract and Payment Terms

  • You will be engaged as an independent contractor.
  • This is a fully remote role that can be completed on your own schedule.
  • Projects can be extended, shortened, or concluded early depending on needs and performance.

Skills: docker,scripting,linux,bash,shell scripting

Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Jobgether

United State

Senior Systems Engineer

Devops
•
1h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Ringside Talent

United State

Senior Kubernetes Engineer

Devops
•
1h ago
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

the planet group

United State

Subscribe our newsletter

New Things Will Always Update Regularly