AI Engineer responsible for owning model behavior in production, evaluating, and refining LLM-powered systems. Requires 3+ years of shipping AI in production at scale, direct ownership of model quality, and experience building evaluation sets.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
AI Engineer — Production LLM Systems & Evaluation
New York, NY or San Francisco, CA · Hybrid or Remote · Full-time
$210K–$350K base + competitive equity
The company
The company is building software that captures expert judgment for regulated industries, starting with financial services.
The first product is an AI-powered third-party risk management platform for financial institutions. It captures the compliance reasoning that normally lives in the heads of senior experts and turns it into software that can be deployed, measured, and improved over time.
The product already serves FDIC-insured banks. The business has gone from 0 to $10M ARR in less than a year, closed a $25M Series A, and has $40M in total contract value. It went from 0 to 5 live deployments in 45 days and is on track to hit 15 live deployments next month.
The team is 8 people, founded in 2023, and includes former regulators, heads of compliance and legal at fintechs, and experienced engineers. The company is backed by leading institutional investors.
The role
This is a hands-on AI engineering role for someone who wants to own model behavior in production.
The scope includes evaluation, workflow design, fine-tuning, release discipline, and turning customer feedback into product improvements.
You will be customer-facing. A big part of the job is working directly with subject matter experts and enterprise customers to understand what they mean by a correct answer, where the system fails, and whether the fix belongs in the model, the prompt, the retrieval layer, or the workflow itself.
Searching for Development & Programming roles that provide visa sponsorship? Connect with international employers through Development & Programming Jobs with Visa Sponsorship opportunities actively seeking talented professionals.
The output of your work should not stop at a single deployment. Customer-specific solutions should be generalized back into the core learning library so the platform gets stronger with each new customer.
The technical problem
Model quality here is a systems problem, not a prompt problem.
The product has to produce grounded, defensible output from messy input, keep its behavior stable across edge cases, and make it easy for humans to review and trust what it returns.
The hard part is building a production system with measurable quality, controlled regressions, and clear feedback loops from real users.
Why now
The hard part has shifted from proving demand to making every deployment better than the last.
With live banks, rapid deployment velocity, and recurring enterprise feedback, the next constraint is model quality at scale: evaluation, reliability, and reuse across customers.
This is the point where engineering decisions become durable platform infrastructure.
What you'll own
- LLM-powered systems and agentic workflows: ship end-user experiences that are accurate, usable, and production-ready.
- Evaluation frameworks: build gold sets, scoring rubrics, regression tests, and release gates that catch quality issues before customers do.
- Model refinement: use fine-tuning, prompt iteration, and data-driven feedback to improve accuracy and consistency.
- Customer-facing iteration: work with SMEs and enterprise users to prototype, validate, and ship improvements quickly.
- Core learning library: generalize one-off wins into reusable platform primitives instead of leaving them as bespoke deployments.
- Production quality: keep the system observable, measurable, and stable as the product and customer base grow.
Explore our comprehensive directory of visa sponsorship jobs from employers worldwide who are ready to sponsor talented international professionals.
Who this is for
You are likely a strong fit if you have:
- 3+ years shipping AI in production at scale, with direct ownership of model quality.
- Built systems where offline evaluation, production behavior, and customer feedback all mattered.
- Owned more than integration work; you have been responsible for the model behavior itself.
- Experience building evaluation sets and using them to make release decisions.
- Comfort working directly with technical and non-technical stakeholders, including domain experts.
- Judgment about when to use rules, prompts, retrieval, fine-tuning, or workflow changes.
- Experience in environments where both false positives and false negatives have real cost.
- The ability to explain technical tradeoffs clearly and without hand-waving.
This role is not for you if
Interested in opportunities specifically in United State? Discover our dedicated Visa Sponsorship Jobs in United State page featuring roles from top employers in this location.
- You want to stay in prototype mode and avoid production ownership.
- You want a research-only role with no customer contact.
- You prefer narrow tickets and heavy specification before you start.
- You are not interested in evaluation rigor, reliability, or reuse.
- You optimize for novelty over repeatable quality.
Compensation and logistics
- Base salary: $210K–$350K
- Equity: competitive
- Location: New York, NY
- Work model: hybrid or remote
- Employment: full-time
- Visa sponsorship: available
About Aurora
Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.
We work with teams that value high ownership, strong technical standards, and clear scope.
Similar Jobs
Explore other opportunities that match your interests
Senior Python Developer - Enterprise Applications & Data Engineering
Bright Vision Technologies