Design and implement systems to measure and improve Computer Use Agents' performance. Define model quality, evaluation metrics, and system prompts. Work with the engineering team to translate product requirements into technical specifications.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
Think Different. Build the Future. 🚀
Our Mission
Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day.
Why AGI, Inc.
We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind. We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale.
Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts.
We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo)
If you see possibility where others see limits, read on.
About The Role
As an Applied Scientist focused on Evaluation & Model Behavior, you will design and implement the systems used to measure and improve the performance of Computer Use Agents.
This is not a support role. You will be responsible for the technical definition of model quality, including the design of evaluation metrics, the curation of training datasets, and the engineering of system prompts. You'll work directly with the engineering team to translate product requirements into technical specifications and quantifiable benchmarks.
You'll focus on rigor, clarity, and impact, ensuring every metric, dataset, and prompt moves us toward more reliable, trustworthy agents.
What You'll Do
Looking to advance your Data Science career with relocation support? Explore Data Science Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
Evaluation Design: Define metrics for reasoning, tool usage, and safety, and validate these metrics against human judgment to ensure statistical rigor.
Data Strategy: Design algorithms to filter, score, and select training data. Write Python scripts to sanitize inputs and manage the training data lifecycle from raw logs to high-quality datasets.
Failure Analysis: Investigate regressions in model benchmarks. Diagnose root causes, distinguishing between data quality issues, prompt instruction failures, or underlying model capability gaps and implement fixes.
Ground Truth Management: Define rubrics and guidelines for human annotation. Maintain reference datasets ("Golden Sets") to establish a consistent baseline for model performance evaluation.
Minimum Qualifications
- Master's degree or PhD in Computer Science, Data Science, Statistics, or a related technical field, or equivalent practical experience
- 3+ years of experience in Data Science, Machine Learning, or Applied Science
- Proficiency in Python, with experience writing production-quality code for data pipelines or evaluation harnesses
- Experience with experimental design, A/B testing, or statistical analysis
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
- Experience with Large Language Models (LLMs), including prompt engineering, fine-tuning, or RLHF workflows
- Experience building automated evaluation systems or implementing model-based evaluation frameworks
- Ability to translate product requirements into measurable technical metrics
- Experience managing human-in-the-loop data pipelines or annotation quality control
You can't improve what you can't measure. You can't ship what you can't trust.
You will define the technical definition of quality for our agents — the metrics that predict real-world success, the datasets that encode user intent, and the prompts that shape model behavior. Your work will directly determine how quickly we can iterate and how confidently we can ship.
Our Culture
🏢 All in, in person — work moves faster face-to-face
Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.
🤝 One band, one sound — radical candor, zero politics
Perks
🏥 Competitive company-sponsored medical, dental, and vision insurance
✈️ Top-tier relocation and immigration support
How To Apply
Send us:
- A link — or 60-second video — of something you built and why it matters
- Your resume or LinkedIn
- Two sentences on the hardest problem you've cracked
If you see possibility where others see limits, we'd love to meet you.
Similar Jobs
Explore other opportunities that match your interests
tek straight llc
OtB Tech LLC
Collections Strategy Director