Senior AI Agent Engineer - Coding Agent Runtime

Aurora United State
Remote Visa Sponsorship
Apply
AI Summary

Own the core agent runtime that turns user intent into interactive mini-apps. Design and implement end-to-end orchestration, evaluation, and reliability systems for production AI workflows. Requires strong technical ownership, debugging skills, and experience shipping coding agents at scale.

Key Highlights
Owns agent runtime and orchestration for multi-step workflows
Builds evaluation frameworks and observability for quality measurement
Makes architecture decisions balancing quality, latency, and cost
Requires production experience with coding agents and long-horizon systems
Key Responsibilities
Own the agent runtime and orchestration layer coordinating planning, tool use, generation, validation, repair, and publishing
Design control flow for multi-step tasks with model failure recovery
Build evaluation harnesses, regression suites, failure taxonomies, and release gates
Develop model strategy for routing, retrying, benchmarking, and swapping models
Implement observability and debugging tools with logs, metrics, alerts, and replayable paths
Improve reliability by reducing failure rates and building graceful degradation systems
Enhance product quality at scale for millions of users
Technical Skills Required
Agentic workflows Evaluation systems Python Observability
Benefits & Perks
Full-time employment
Remote work
Competitive equity
H1B visa sponsorship

Job Description


Senior AI Agent Engineer — Coding Agent Runtime


Remote (US) · Pacific or Central Time Zone · Full-time

$150K–$250K base + competitive equity



The company


The company is building a consumer social platform around interactive mini-apps: users browse a feed of playable experiences and can create their own by describing what they want.


The creation flow is AI-native. It turns natural language into shareable, interactive content that can be used immediately and published.


The product already has real scale, with 1M+ monthly active users.


The company is backed by a16z, Mayfield, and Khosla, and has raised $30M.


The team is small, around 30 people, and the founder/operator background comes from consumer social at ByteDance.



The role


This is the engineer who owns the core engine behind creation: the agent runtime that turns intent into working mini-apps.


You will set the technical bar for the system end to end: architecture, orchestration, evaluation, reliability, and model strategy.


This is a greenfield-scope role. The team built its own agent framework from scratch, so you are shaping the platform rather than inheriting a large legacy system.


The output is user-facing product quality, not a demo. The bar is whether the system can repeatedly produce correct, shareable experiences under real consumer usage.



The technical problem


The hard part is not getting a model to generate code once.


The hard part is building a system that can reliably handle ambiguous user intent, long-horizon reasoning, execution failures, validation loops, and publication, while staying debuggable, measurable, and cost-aware.


It also has to know when to trust the model and when to fall back to deterministic validation or rerouting.


The core workflow looks like this: prompt → plan → generate → run/validate → repair → publish.


That workflow has to work across messy edge cases: incomplete specs, failing builds, bad intermediate outputs, model regressions, and user-level quality problems that are hard to diagnose without strong tracing and evaluation infrastructure.



What you'll own


• Agent runtime and orchestration: own the execution layer that coordinates planning, tool use, generation, validation, repair, and publishing.

• Long-horizon workflows: design the control flow for multi-step tasks where the model has to recover from failure and still produce a usable result.

• Evaluation and quality loops: build eval harnesses, regression suites, failure taxonomies, and release gates that make quality measurable.

• Model strategy: choose when to route, retry, benchmark, or swap models based on reliability, task type, latency, and cost.

• Observability and debugging: make agent behavior traceable with logs, metrics, alerts, and replayable debugging paths.

• Reliability improvements: reduce failure rates, tighten feedback loops, and build systems that degrade gracefully when models or dependencies fail.

• Product quality at scale: improve the creation experience for a user base already in the millions, where small quality regressions become visible quickly.



Who this is for


You are likely a strong fit if you have:


• Owned production AI or ML systems end to end, not just a component of one.

• Built or operated coding agents, agentic workflows, or similar long-horizon systems in production.

• Strong judgment about when to rely on the model and when to enforce deterministic checks, validation, or fallback paths.

• Designed evaluation systems where quality is measured continuously, not debated qualitatively.

• Experience debugging failures from traces, logs, and metrics rather than from user anecdotes alone.

• Comfort making architecture calls that trade off quality, latency, and cost.

• Enough product instinct to reason about the user experience of creation, not just the internal machinery.

• The ability to move from vague product intent to a concrete technical plan without waiting for someone else to define the system.


Research experience is fine, but shipping production systems matters more here.



This role is not for you if


• You want narrowly scoped tasks with clear upstream specs.

• You prefer working on models in isolation rather than owning the runtime around them.

• You are uncomfortable being accountable for quality, latency, and cost at the same time.

• You do not want to build evals, instrumentation, and debugging tools as part of the job.

• You want a role where the core architecture is already settled.



Compensation and logistics


• Base salary: $150K–$250K

• Equity: competitive

• Employment: full-time

• Workplace: remote, US only

• Timezone: Pacific or Central Time Zone

• Cadence: daily 6pm standups and a Sunday evening standup that aligns with China Monday

• Visa support: H1B transfers and some new H1Bs considered case by case



What you should be able to explain


• Have you built a coding agent in production? Walk through the architecture.

• Do you have an eval framework for agentic systems? Show how it catches regressions.

• How do you debug a long-horizon workflow when the failure appears three steps after the root cause?

• How do you decide between model quality, latency, and cost when routing traffic or changing models?



About Aurora


Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.


We work with teams that care about high ownership, technical rigor, and clear scope.


Similar Jobs

Explore other opportunities that match your interests

Principal AI/ML Architect

Machine Learning
14h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Stratus

United State

Senior Machine Learning Engineer - Real-Time Pricing at Scale

Machine Learning
15h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

gtmfund

United State

Machine Learning Systems Engineer

Machine Learning
20h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Entry level

Aurora

United State

Subscribe our newsletter

New Things Will Always Update Regularly