Senior Platform Engineer, HPC Scheduling

gtn technical staffing • United State

Relocation

Apply

AI Summary

Design, build, and scale high-performance compute platforms for large-scale research, machine learning, and batch workloads. Develop and operate distributed systems, infrastructure automation, and HPC scheduling environments. Work with Kubernetes, Go, Rust, C++, and Python.

Key Highlights

Design, develop, and maintain high-quality platform software

Build scalable, reliable, and globally distributed systems

Contribute to the development and enhancement of Kubernetes-based scheduling platforms

Key Responsibilities

Design, develop, and maintain high-quality platform software using Go, Rust, C++, Python, or similar systems-level programming languages

Build scalable, reliable, and globally distributed systems that support large-scale research and ML workloads

Contribute to the development and enhancement of Kubernetes-based scheduling platforms, including Armada

Technical Skills Required

Kubernetes Go Rust C++ Python PostgreSQL Apache Kafka Prometheus Grafana Slurm Armada Volcano Kueue

Benefits & Perks

Base salary: $170,000 – $250,000

100% company-paid benefits

Performance bonus

Nice to Have

Production experience with Go, especially in Kubernetes, infrastructure, or distributed systems environments

Experience with Armada, Slurm, Volcano, Kueue, or similar scheduling technologies

Job Description

Senior Platform Engineer, HPC Scheduling

Location: Dallas, TX | Relocation available for non-local candidates

Type: Direct Hire

• Base salary: $170,000 – $250,000 + performance bonus

• 100% company-paid benefits

Overview

We are seeking a Senior Platform Engineer, HPC Scheduling to help design, build, and scale a high-performance compute platform supporting large-scale research, machine learning, and batch workload execution.

This role sits on an HPC Scheduling team responsible for developing and operating distributed compute systems that enable complex research workloads to run efficiently across Kubernetes-based environments. The team is focused on advancing batch scheduling, multi-cluster orchestration, and scalable infrastructure for advanced ML and compute-intensive workloads.

A major focus of this role will be working on an open-source CNCF project used to support multi-cluster Kubernetes batch job scheduling at scale. This is a hands-on platform engineering role for someone who enjoys building production-grade software, working deeply with Kubernetes, and solving complex infrastructure challenges in high-scale environments.

The ideal candidate brings strong software engineering experience, deep Kubernetes platform knowledge, and the ability to operate across distributed systems, infrastructure automation, and HPC scheduling environments. While Go/Golang is preferred, candidates with strong production engineering experience in Rust, C++, or Python will also be considered.

Key Responsibilities

Platform Engineering & Software Development

• Design, develop, and maintain high-quality platform software using Go/Golang, Rust, C++, Python, or similar systems-level programming languages

• Build scalable, reliable, and globally distributed systems that support large-scale research and ML workloads

Looking to advance your Development & Programming career with relocation support? Explore Development & Programming Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.

• Contribute to the development and enhancement of Kubernetes-based scheduling platforms, including Armada

• Develop and maintain Kubernetes components such as controllers, operators, custom resources, and internal platform services

• Apply strong software architecture, computer science fundamentals, and data structure knowledge to guide technical design and code quality

Kubernetes, Scheduling & Distributed Systems

• Build and operate containerized applications within Kubernetes environments

• Support advanced workload orchestration, scheduling, and batch processing across multi-cluster environments

• Work with HPC, Kubernetes, DAG-based workflows, and job scheduling systems such as Slurm

• Improve scheduling efficiency, workload placement, resource utilization, and platform reliability

• Partner with engineering and research teams to support complex compute and ML workload requirements

Infrastructure, Data & Operations

• Manage and optimize data interactions across relational and non-relational systems, particularly PostgreSQL

• Support Linux-based systems as part of the core compute and scheduling platform

• Apply networking fundamentals to troubleshoot, optimize, and improve platform connectivity and performance

• Diagnose and resolve complex issues across software, infrastructure, Kubernetes, and distributed systems layers

Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.

• Operate systems at scale in cloud environments, ideally AWS

Observability, Automation & Best Practices

• Build and improve CI/CD pipelines, release processes, and platform engineering workflows

• Implement and support observability practices using tools such as Prometheus, Grafana, and logging platforms

• Work with event-driven systems and message queues such as Apache Kafka, Pulsar, or similar technologies

• Drive continuous improvement across reliability, scalability, automation, and engineering standards

• Stay current with emerging technologies in Kubernetes, HPC, batch scheduling, and distributed systems

Required Qualifications

• Strong software engineering background with hands-on experience developing production systems in Go/Golang, Rust, C++, Python, or similar programming languages

• Experience developing Kubernetes components such as controllers, operators, or custom resources

• Experience building, operating, or supporting distributed systems at scale

• Strong working knowledge of Kubernetes, containers, Linux, and cloud infrastructure

• Experience with batch computing, workload scheduling, HPC, or DAG-based workflow systems

• Experience with PostgreSQL or similar relational database technologies

Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.

• Familiarity with message queues or event-driven platforms such as Kafka, Pulsar, or similar tools

• Experience with observability tools such as Prometheus, Grafana, logging systems, and operational dashboards

• Ability to independently troubleshoot complex technical issues across infrastructure and application layers

• Strong understanding of software design principles, data structures, and computer science fundamentals

Preferred Qualifications

• Production experience with Go/Golang, especially in Kubernetes, infrastructure, or distributed systems environments

• Experience with Armada, Slurm, Volcano, Kueue, or similar scheduling technologies

• Experience supporting ML, AI, research, or high-throughput compute workloads

• Experience operating large-scale Kubernetes environments across multiple clusters

• Experience with AWS or another major cloud provider

• Background contributing to open-source infrastructure, platform, or CNCF projects

• Experience with performance tuning, reliability engineering, and large-scale systems optimization

Ideal Profile

The ideal candidate is a hands-on platform engineer with strong software development skills, deep Kubernetes experience, and a strong interest in batch scheduling, HPC, and distributed systems. This person should be comfortable building production-grade software in Go, Rust, C++, or Python, operating Linux and Kubernetes environments at scale, and solving complex scheduling and infrastructure challenges for high-performance research and ML workloads.

Job Overview

Posted Date May 09, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Annual Salary 170,000 - 250,000 USD

Category Programming

Company gtn technical staffing

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Full Stack Java Developer - Cloud & AI

Programming

•

2h ago

Visa Sponsorship Relocation Remote

Job Type Temporary

Experience Level Mid-Senior level

KPG99 INC

United State

Operability Lead Engineer

Programming

•

3h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

GE Aerospace

United State

Senior Software Developer - Geocoding Solutions

Programming

•

3h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Esri

United State

Senior Platform Engineer, HPC Scheduling

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Full Stack Java Developer - Cloud & AI

KPG99 INC

Operability Lead Engineer

Premium Job

GE Aerospace

Senior Software Developer - Geocoding Solutions

Premium Job

Esri

Subscribe our newsletter