Design, optimize, and scale runtime systems for large-scale AI and ML models. Deploy and maintain LLMs and ML models using various engines. Implement and optimize distributed inference systems for text, image, and multimodal workloads.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Job Title: AI & ML Systems Engineer
(also referred to as Inference Engineer)
Location:
Remote – Work From Home (fully flexible)
Job Timing:
Part-Time, flexible hours
About the Role:
We are building a next-generation cloud platform designed for serving multimodal AI, LLMs, vision, audio, and other machine learning models at large scale. As a Machine Learning Engineer focusing on Inference & Systems, you will help design, optimize, and scale runtime systems, OpenAI-style APIs, and distributed GPU pipelines to enable fast and cost-efficient inference and fine-tuning.
You'll work with frameworks such as vLLM, TensorRT-LLM, TGI, and others to build and optimize distributed inference engines capable of serving text, vision, and multimodal models with high throughput and low latency. This includes deploying models like LLaMA 3, Mistral, diffusion, ASR, TTS, and embedding models, while working on GPU optimization, accelerator utilization, and software–hardware co-design for large-scale, fault-tolerant systems.
This position sits at the intersection of machine learning, systems engineering, and cloud infrastructure. You’ll focus on low-latency inference, high-throughput deployments, and cost-optimized model serving pipelines. It’s an excellent opportunity to help shape the future of AI inference infrastructure and production-grade deployment systems.
If pushing the limits of AI inference excites you, we’d love to hear from you.
Key Responsibilities:
- Deploy and maintain LLMs (e.g., LLaMA 3, Mistral) and ML models using engines like vLLM, TGI, TensorRT-LLM, or FasterTransformer.
- Design and implement large-scale distributed inference systems for text, image, LLMs, and multimodal workloads.
- Implement and optimize distributed inference strategies: MoE, tensor parallelism, pipeline parallelism.
- Integrate frameworks such as vLLM, TGI, SGLang, FasterTransformer, etc.
- Build and scale OpenAI-compatible API layers for customer-facing access.
- Experiment with caching, quantization, and parallelism to lower inference costs.
- Optimize GPU memory usage, batching, and latency for high-throughput serving.
- Utilize CUDA graph optimizations, TensorRT-LLM, Triton kernels, PyTorch compile, quantization, speculative decoding, etc.
- Work with GPU cloud providers (RunPod, Vast.ai, AWS, GCP, Azure) to manage cost and availability.
- Develop runtime inference services and APIs for LLMs, multimodal models, and fine-tuning workflows.
- Build monitoring and observability using metrics like latency, throughput, and GPU utilization (Grafana, Prometheus, Loki, OpenTelemetry).
- Collaborate with backend and DevOps teams to ensure secure and reliable APIs.
- Document deployment processes and support engineers using the platform.
Requirements:
- 3+ years of experience in deep learning inference, distributed systems, or HPC.
- Proven experience deploying ML/LLM models in production.
- Hands-on work with vLLM, TGI, SGLang, TensorRT-LLM, FasterTransformer, or Triton.
- Experience designing large-scale inference or serving pipelines.
- Strong understanding of GPU memory, batching, distributed inference, CUDA/Triton/TensorRT, quantization, and GPU scheduling.
- Experience with PyTorch, HF Transformers, and GPU-accelerated inference workflows.
- Deep understanding of Transformer models, KV cache systems (Mooncake, PagedAttention, etc.), and inference optimizations for long-context serving.
- Comfortable with GPU cloud platforms (AWS/GCP/Azure) or marketplaces (RunPod, Vast.ai, TensorDock).
- Skilled at benchmarking and tuning multi-GPU clusters.
- Experience building REST or gRPC services (FastAPI, Flask, etc.).
- Strong in Python, Go, Rust, C++, or CUDA.
- Solid systems engineering knowledge (multi-threading, networking, performance tuning).
- Familiarity with Docker and Kubernetes.
- Strong debugging/problem-solving across ML + infra stack.
- Understanding of distributed storage systems (Ceph, HDFS, 3FS).
- Knowledge of datacenter networking concepts (RDMA, RoCE).
Nice to Have:
- Experience with billing systems (Stripe or similar).
- Knowledge of RDMA/RoCE networking at scale.
- Familiarity with distributed storage (Ceph, HDFS, 3FS).
- Experience with Redis or Envoy for rate limiting.
- Exposure to monitoring stacks (Grafana, Prometheus, Loki).
- Experience with MLOps pipelines or CI/CD (GitHub Actions, Azure DevOps).
- Work with model fine-tuning pipelines and GPU scheduling.
- Prior experience at an AI infra company (Modal, Together.ai, Anyscale, Replicate, etc.).
Why Join Us?
- Fully Remote – work from anywhere.
- Flexible Hours – complete freedom to choose your schedule.
- Fast-Track Growth – quick promotion pathways and clear advancement routes.
- Professional Development – mentoring, training resources, and exposure to advanced AI/cloud technologies.
- Global Team – work with an international and diverse engineering group.
- Innovative Environment – freedom to experiment with new tools and ideas.
- Competitive Growth & Incentives – rapid salary progression and strong performance bonuses.