Senior ML Inference Optimization Engineer

coffeespace • San Francisco Bay Area
Visa Sponsorship Relocation
Apply
AI Summary

Optimize large multimodal models for real-time video conversation. Collaborate with founders and researchers to ship the company's first major product. Work on frontier AI involving video, speech, and language.

Key Highlights
Optimize models for inference latency, throughput, and cost efficiency
Collaborate with founders and researchers to ship the first major product
Work on frontier multimodal AI involving video, speech, and language
Key Responsibilities
Profile and optimize large multimodal models for inference latency, throughput, and cost efficiency
Identify bottlenecks using tools such as NSight, Torch Profiler, CUDA debugging tools, and production observability systems
Apply acceleration techniques including quantization, pruning, distillation, TensorRT, ONNX, Triton, and vLLM
Build and maintain infrastructure that supports researchers from experimentation through deployment
Develop evaluation frameworks that measure performance, quality, and operational reliability
Collaborate with research teams on model architecture decisions that impact production performance
Technical Skills Required
PyTorch CUDA GPU systems Inference optimization Quantization Pruning Distillation TensorRT ONNX Triton vLLM
Benefits & Perks
$250K-$450K base + equity
Visa sponsorship available
Relocation support available

Job Description


Job Title: ML Inference Optimization Engineer

Salary: $250K-$450K base + equity

Location: Seattle, WA (5 days in-office)

Visa sponsorship available

Relocation support available


Company Description

Venture-backed AI startup building real-time visual conversational AI.


The team is tackling one of the hardest problems in modern AI: creating systems that can understand and respond to human emotion, behavior, and conversation in real time through video.


Backed by over $60M in funding, the company is moving from research to production and assembling a small team of engineers who can bridge cutting-edge models with real-world deployment.


Job Description

Join a team of researchers and engineers building multimodal AI systems capable of real-time video conversation.


This role sits at the intersection of machine learning, systems engineering, and performance optimization. You'll own the infrastructure and tooling that allows large multimodal models to run efficiently in production, working across video diffusion models, LLMs, speech systems, and future foundation models.


This is not a model training role.


The focus is on making state-of-the-art models faster, cheaper, and production-ready through deep profiling, inference optimization, and GPU systems work.


You'll work directly with founders, researchers, and infrastructure engineers to help ship the company's first major product release.


Why this role is remarkable

  • Work on frontier multimodal AI involving video, speech, and language
  • Optimize systems where milliseconds directly impact user experience
  • Join a highly technical 14-person team backed by over $60M in funding
  • Own performance across the full stack from model architecture to deployment
  • Influence core technical decisions in a flat, founder-led organization


What you will do

  • Profile and optimize large multimodal models for inference latency, throughput, and cost efficiency
  • Identify bottlenecks using tools such as NSight, Torch Profiler, CUDA debugging tools, and production observability systems
  • Apply acceleration techniques including quantization, pruning, distillation, TensorRT, ONNX, Triton, and vLLM
  • Build and maintain infrastructure that supports researchers from experimentation through deployment
  • Develop evaluation frameworks that measure performance, quality, and operational reliability
  • Collaborate with research teams on model architecture decisions that impact production performance


The ideal candidate

  • 2+ years of hands-on ML engineering experience working on production systems
  • Strong experience profiling and optimizing LLMs, diffusion models, or other large neural networks
  • Deep familiarity with PyTorch, CUDA, GPU systems, and modern inference tooling
  • Experience deploying and operating ML systems at scale rather than exclusively training models
  • Startup experience preferred, though exceptional candidates from larger companies are welcome
  • Strong ownership mentality and ability to move between systems, infrastructure, and modeling challenges


Less likely to be a fit

  • Candidates whose primary ML experience is fine-tuning models
  • Research-focused profiles without production ownership
  • Recent management or director-level candidates no longer working hands-on
  • Generalists without demonstrated depth in inference optimization or GPU systems


Next steps

  1. Apply through this LinkedIn posting.
  2. If there is a strong fit, we'll reach out directly with additional information and introductions to the hiring team.
  3. If this specific role isn't the best match, we may also suggest other high-signal startup opportunities aligned with your background and interests.


A quick note on authenticity

This is a real, active role that we are recruiting for in close partnership with the hiring team. We work directly with founders on their hiring needs and only represent active opportunities.


Similar Jobs

Explore other opportunities that match your interests

Senior Deep Learning Researcher

Machine Learning
•
1d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

river ai

San Francisco Bay Area

Founding Machine Learning Engineer

Machine Learning
•
1d ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Stealth Startup

San Francisco Bay Area
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

kadence

San Francisco Bay Area

Subscribe our newsletter

New Things Will Always Update Regularly