Build and optimize production infrastructure for diffusion-based language models, focusing on serving, compilation, and deployment reliability. Own end-to-end systems from model execution to cloud rollout across AWS and Azure. Requires deep technical ownership and cross-functional collaboration with researchers and engineering leadership.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
Machine Learning Systems Engineer
Palo Alto, CA · On-site · Full-time
$200K–$300K base + competitive equity
The company
The company is building diffusion-based language models that generate tokens in parallel instead of one at a time.
That architecture is designed to reduce latency and cost while preserving quality, and it has already moved beyond research.
The team launched the first commercially available dLLM, Mercury, in early 2025 and is now deploying large-scale diffusion LLMs at Fortune 500 companies.
The company has raised $56M, is about 20 people, and operates as a small, deeply technical team in Palo Alto.
This is not a lab demo. The product is already being used in enterprise settings, which makes production performance, deployment reliability, and systems quality first-order problems.
The role
This is a machine learning systems role for someone who wants ownership of the infrastructure around model performance: serving, compilation, optimization, benchmarking, deployment, and reliability.
You will work directly with researchers and engineering leadership to move models from implementation to production systems that are measurable, reproducible, and fast enough for real customers.
The scope is broad enough to feel staff-level in practice. The hard problems are the ones that decide whether the model is usable in the real world: throughput, latency, memory use, hardware efficiency, rollout safety, and operational stability.
The technical problem
Searching for Machine Learning & AI roles that provide visa sponsorship? Connect with international employers through Machine Learning & AI Jobs with Visa Sponsorship opportunities actively seeking talented professionals.
Diffusion LLMs change the inference problem.
You are not just serving a model. You are making a new architecture run efficiently across GPUs, runtimes, and cloud environments while preserving output quality and deployment reliability.
The challenge is to connect research code to production systems without losing the performance characteristics that make the architecture valuable in the first place.
That means the work spans model execution, runtime optimization, infrastructure, and evaluation rather than only training or only serving.
What you'll own
• Model serving infrastructure: build and improve the systems that serve diffusion LLMs in production, with attention to latency, throughput, and stability.
• Performance optimization: work across CUDA, TensorRT, ONNX Runtime, vLLM, and SGLang to reduce bottlenecks and improve hardware utilization.
• Deployment pipelines: make model rollout reproducible across Kubernetes and cloud environments, with safe release and rollback mechanisms.
• Benchmarking and evaluation: build measurement systems that separate real model gains from runtime noise and infrastructure effects.
• Systems debugging: trace failures across Python, containerized services, GPU execution, and orchestration layers.
• Scaling infrastructure: help adapt the stack as customer load grows and model requirements evolve across AWS and Azure.
• Cross-functional execution: work closely with researchers to turn model changes into production-ready performance improvements.
Who this is for
You are likely a strong fit if you have:
• Built production ML infrastructure or inference systems where latency, throughput, and cost are explicit design constraints.
• Strong judgment around GPU utilization, memory pressure, batching, and runtime tradeoffs.
• Experience with PyTorch, CUDA, serving runtimes, or deployment stacks that sit between model code and production traffic.
• Comfort reading profiles, tracing bottlenecks, and turning ambiguous performance issues into concrete fixes.
Explore our comprehensive directory of visa sponsorship jobs from employers worldwide who are ready to sponsor talented international professionals.
• Shipped systems where correctness, reproducibility, and operational reliability mattered as much as raw speed.
• The ability to work directly with researchers and translate model behavior into systems decisions.
• Experience operating in environments where requirements change as quickly as the model stack.
• Enough range to move from code-level debugging to infrastructure design without handoff overhead.
Tech stack
• Serving and optimization: vLLM, TensorRT, ONNX Runtime, SGLang
• Modeling and training: PyTorch, TensorFlow
• GPU and systems: CUDA, Docker, Kubernetes
• Infrastructure: Python, AWS, Azure, Kubeflow
The stack is broad because the work sits across research, inference, deployment, and cloud infrastructure. The best candidates will understand where each layer creates leverage and where it becomes a bottleneck.
Why now
The company has already proven the core idea with a commercial product and enterprise deployments.
The next problem is not whether the model works in principle. It is whether the system can serve real demand with predictable performance, stable rollouts, and a runtime stack that keeps up with model progress.
This is the point where systems engineering matters most: the architecture decisions made now will shape how efficiently the product can scale across customers and hardware generations.
This role is not for you if
• You want a narrowly scoped feature role with clean handoffs.
• You prefer working only on model research and do not want systems ownership.
Interested in opportunities specifically in United State? Discover our dedicated Visa Sponsorship Jobs in United State page featuring roles from top employers in this location.
• You are uncomfortable debugging across GPU, runtime, container, and orchestration layers.
• You do not want to work on-site most days in Palo Alto.
• You need strict process separation between research, infrastructure, and product execution.
Compensation and logistics
• Base salary: $200K–$300K
• Equity: competitive
• Location: Palo Alto, CA
• Work model: on-site, 5 days per week in Palo Alto
• Visa support: available
• Employment: full-time
Interview process
Typical process:
• Intro call — 20 min: background, scope, and fit.
• Technical coding rounds: engineering depth and problem-solving.
• Onsite-style panel with founders: usually remote.
• References: final stage.
About Aurora
Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.
We work with teams that expect high ownership, technical depth, and direct accountability.
Similar Jobs
Explore other opportunities that match your interests
clera
Senior Machine Learning Platform Engineer
Chime
Sr Distinguished Machine Learning Engineer