DevOps Engineer for GPU-Accelerated Computing Platform

drexel university • United State
Remote
Apply
AI Summary

Join Drexel University's University Research Computing Facility to build and operate a new shared computing platform focused on GPU-accelerated workloads. As a DevOps Engineer, you will develop automation, contribute to Kubernetes, and troubleshoot issues across the stack. This is a grant-funded position with a fully remote work arrangement.

Key Highlights
GPU-accelerated computing platform
Kubernetes development
Troubleshooting and automation
Key Responsibilities
Develop and maintain automation for provisioning, configuring, and managing the cluster
Contribute to the Kubernetes platform layer
Troubleshoot issues across the stack
Technical Skills Required
Ansible Warewulf Kubernetes Linux systems administration Git SSH Python Bash
Benefits & Perks
Grant-funded position
Fully remote work arrangement
Salary range $90,430.00 - $135,640.00 per year
Nice to Have
Experience with bare-metal provisioning or HPC cluster management
Familiarity with Ansible, Warewulf, RKE2, Cilium, Kubeflow, Weka, iRODS, Globus, infrastructure-as-code tools generally

Job Description


Job Summary

The University Research Computing Facility (URCF) at Drexel University is building a new shared computing platform focused on GPU-accelerated workloads, particularly AI model training. The system includes GPU and CPU compute nodes with Nvidia H200, A100, and Grace Hopper hardware, orchestrated by Kubernetes on bare-metal, as well as a 1 PB high-performance Weka storage cluster and a 3 PB S3-compatible archival storage system with iRODS as the metadata layer. The DevOps Engineer will help build and operate this platform, working alongside the URCF’s Research Computing Specialist and collaborators in Drexel IT.

The platform is under active development, and URCF is itself in the process of adopting container-native tools and workflows coming from a more traditional HPC background. This means the role involves building new things, improving what exists, and navigating some institutional learning curves alongside us.

We Currently Use The Following Technologies

  • Ansible
  • Warewulf
  • Proxmox
  • Kubernetes (RKE2)
  • Cilium
  • Kyverno
  • Envoy
  • Kubeflow
  • Weka
  • iRODS
  • STORJ
  • Globus
  • Rocky Linux
  • Python and
  • Bash.

PLEASE NOTE: You don’t need experience with all of these. We include the list so you can get a sense of the environment

This is a grant-funded position through September 1, 2027. It is fully remote. If you’re not sure whether you’re qualified, we’d encourage you to apply anyway.

This position is grant-funded; employment is contingent upon the continued availability of those funds.

Essential Functions

  • Develop and maintain automation for provisioning, configuring, and managing the cluster (Ansible, Warewulf, Kubernetes manifests, shell scripts).
  • Contribute to the Kubernetes platform layer, including networking, storage integration, security policies, and workload orchestration.
  • Help built out storage infrastructure, including iRODS and Globus/Globus Connect Server for data transfer, as well as the integrations between these systems and the compute cluster.
  • Troubleshoot issues across the stack, from bare-metal boot problems to container orchestration bugs.
  • Write and maintain operational and user-facing documentation.
  • Coordinate with Drexel’s IT teams on shared infrastructure concerns (networking, DNS, firewall rules, etc.).
  • Contribute to web application development for a user-facing portal for project management, permissions, and usage tracking.

Required Qualifications

  • Minimum of a Bachelor's Degree in Computer Science, Engineering, or a related field or the equivalent combination of education and work experience ( Please review the Equivalency Chart for additional information).
  • Minimum of 1–3 years of experience.
  • Experience with infrastructure tooling such as Linux systems administration, configuration management, containers, or container orchestration.
  • Comfortable working in a terminal with tools like Git, SSH, and a text editor.
  • Working proficiency with at least one scripting language (Python, Bash, etc.).
  • Strong written communication skills.
  • Ability to work independently and manage your own time in a fully remote setting.

Preferred Qualifications

  • Experience with Kubernetes.
  • Experience with bare-metal provisioning or HPC cluster management.
  • Familiarity with any of: Ansible, Warewulf, RKE2, Cilium, Kubeflow, Weka, iRODS, Globus, infrastructure-as-code tools generally.
  • Web application development experience (any stack).
  • Experience in an academic or research computing environment.

Physical Demands

  • Typically sitting at a desk/table
  • Lifting demands ≤ 25lbs

Location

  • Remote

Additional Information

This position is classified as Exempt, grade M. Compensation for this grade ranges from $90,430.00 - $135,64000 per year. Please note that the offered rate for this position typically aligns with the minimum to midrange of this grade, but it can vary based on the successful candidate’s qualifications and experience, department budget, and an internal equity review.

Applicants are encouraged to explore the Professional Staff salary structure and Compensation Guidelines & Policies for more details on Drexel’s compensation framework. For information about benefits, please review Drexel’s Benefits Brochure .

Special Instructions to the Applicant

Please make sure you upload your CV/resume and cover letter when submitting your application.

A review of applicants will begin once a suitable candidate pool is identified.


Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Internship
Experience Level Entry level

drexel university

United State

AWS Data Engineer

Devops
•
1h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

polar it

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Coltech

United State

Subscribe our newsletter

New Things Will Always Update Regularly