AI Infrastructure/Platform Engineer

LanceSoft, Inc. • United State
Remote
Apply
AI Summary

Join LanceSoft, Inc. as an AI Infrastructure/Platform Engineer to build and operate large-scale GPU compute infrastructure. Key responsibilities include designing and delivering platform features, partnering with development teams, and applying expertise in storage and networking. Ideal candidate should have experience in Platform, Infrastructure, DevOps Engineering and hands-on experience with Kubernetes and container orchestration at scale.

Key Highlights
Build and extend platform capabilities
Design and operate scalable orchestration systems
Partner with development teams to extend GPU developer platform
Key Responsibilities
Build and extend platform capabilities to enable different classes of workloads
Design and operate scalable orchestration systems using Kubernetes across both on-prem and multi-cloud environments
Develop platform features such as pre-flight health checks, job status monitoring and post-mortem analysis
Technical Skills Required
Kubernetes Container orchestration Platform Engineering DevOps Engineering GPU compute infrastructure Storage and networking
Benefits & Perks
$100.00-$107.00/hr
6-month contract with possible extension
Hybrid work arrangement (onsite in San Jose, CA or 100% remote)
Nice to Have
Hands-on experience in storage or network engineering within Kubernetes environments
Experience with Infrastructure as Code tools like Terraform
Background in HPC, Slurm, or GPU-based compute systems for ML/AI workloads

Job Description


Pay Rate: $100.00/hr to $107.00/hr

Duration: 6 Months - possible extension

Location: Onsite in San Jose, CA - Hybrid (100% remote is fine as well for the strong candidate)


THE ROLE:

We are seeking an AI Infrastructure / Platform Engineer to join our team building and operating large-scale GPU compute infrastructure that powers AI and ML workloads. The ideal candidate should be passionate about software engineering and possess leadership skills to independently deliver on multiple projects. They should be able to communicate effectively and work optimally with their peers within our larger organization.


THE PERSON:

Experience in Platform, Infrastructure, DevOps Engineering.

Deep hands-on experience with Kubernetes and container orchestration at scale.

Proven ability to design and deliver platform features that serve internal customers or developer teams

Experience building developer-facing platforms or internal developer portals (e.g. Custom workflow tooling).


KEY RESPONSIBILITIES:

Build and extend platform capabilities to enable different classes of workloads (e.g., Large-scale AI training, inferencing etc).

Design and operate scalable orchestration systems using Kubernetes across both on-prem and multi-cloud environments.

Develop platform features such as pre-flight health checks, job status monitoring and post-mortem analysis.

Partner with development teams to extend the GPU developer platform with features, APIs, templates, and self-service workflows that streamline job orchestration and environment management.

Apply expertise in storage and networking to design and integrate CSI drivers, persistent volumes, and network policies that enable high-performance GPU workloads.

Production support on large-scale GPU clusters.


PREFERRED EXPERIENCE:

Hands-on experience in storage or network engineering within Kubernetes environments (e.g., CSI drivers, dynamic provisioning, CNI plugins, or network policy).

Experience with Infrastructure as Code tools like Terraform.

Background in HPC, Slurm, or GPU-based compute systems for ML/AI workloads.

Practical experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.).

Understanding of machine learning frameworks (PyTorch, vLLM, SGLang, etc.).

High performance network and IB/RDMA tuning.


ACADEMIC CREDENTIALS:

Bachelor’s or master's degree in computer science, computer engineering, electrical engineering, or equivalent.


Similar Jobs

Explore other opportunities that match your interests

AI Cloud Infrastructure Engineer

Devops
•
15h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

omni studio

United State

Cloud Application Architect

Devops
•
1d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

NTT DATA North America

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Associate

remotehunter

United State

Subscribe our newsletter

New Things Will Always Update Regularly