MLOps/AI Infrastructure Engineer

Centific • United State

Remote

Apply

AI Summary

Design and implement AI infrastructure for on-premises, edge, and cloud environments. Manage GPU clusters, Kubernetes, and networking. Collaborate with AI/ML engineers and client teams.

Key Highlights

Manage GPU clusters, Kubernetes, and networking

Collaborate with AI/ML engineers and client teams

Implement AI infrastructure for on-premises, edge, and cloud environments

Key Responsibilities

Deploy, configure, and maintain on-premises GPU servers

Implement and tune NVIDIA-specific tooling

Manage bare-metal provisioning workflows

Monitor hardware health, capacity utilization, and thermal/power envelopes

Build, upgrade, and maintain production-grade Kubernetes clusters

Design and operate cluster networking

Configure and manage MetalLB or equivalent bare-metal load balancing

Implement MLOps pipelines and AI workload management

Deploy and operate MLOps platforms

Configure and manage NVIDIA Triton Inference Server

Optimize GPU utilization for batch training jobs and latency-sensitive inference services

Manage model artifact storage and versioning

Design and implement the high-bandwidth network fabric required for GPU cluster interconnects

Deploy and operate software-defined storage solutions

Configure network segmentation, VLANs, and firewall policies

Establish and maintain VPN or secure tunneling solutions

Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements

Maintain hardened OS baselines

Produce and maintain infrastructure documentation

Technical Skills Required

NVIDIA GPU infrastructure Kubernetes administration Networking fundamentals Software-defined storage MLOps experience NIST SP 800-171 controls Infrastructure-as-code tooling Linux systems administration

Benefits & Perks

Competitive, globally benchmarked compensation

Fully remote with async-first culture

Periodic travel to client facilities and team on-sites

Access to cutting-edge NVIDIA hardware

Comprehensive healthcare, dental, and vision coverage

401k plan

Paid time off (PTO)

Job Description

MLOps/AI Infrastructure Engineer

Remote — On-site availability at client facilities may be required

Full-time with Centific

About the Role

Our Vision AI platform runs where the data is generated — on-premises, inside government facilities, and at the network edge — not in a hyperscaler cloud. That means the infrastructure has to be bulletproof: GPU clusters provisioned correctly, Kubernetes workloads scheduled efficiently across heterogeneous compute, storage performing at the throughput AI training and inference demands, and the network capable of handling high-bandwidth, low-latency sensor data at scale.

As our MLOps / AI Infrastructure Engineer, you will own all of it. You will rack, configure, and operate the on-premises compute and GPU infrastructure that powers the platform, build and maintain the Kubernetes clusters that orchestrate AI workloads, design the networking fabric that ties edge nodes to core compute, and implement the MLOps pipelines that take models from development to production. You will work directly with our AI/ML engineers, the Lead Architect, and on-site client technical teams to ensure the platform runs reliably in environments that are often air-gapped, physically secured, and subject to strict government compliance requirements.

Key Responsibilities

GPU Compute & Hardware Infrastructure

Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes — including driver management, CUDA toolkit versioning, NVLink/NVSwitch topology, and firmware updates.
Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager) for health monitoring and telemetry, MIG (Multi-Instance GPU) partitioning for multi-tenant workloads, and NVIDIA Container Toolkit for GPU-aware containerization.
Manage bare-metal provisioning workflows (iPXE, PXE, or tools such as MAAS/Foreman) to enable repeatable, auditable server builds at client sites.
Monitor hardware health, capacity utilization, and thermal/power envelopes; define alerting thresholds and respond to hardware failures with minimal service disruption.

Kubernetes & Container Orchestration

Build, upgrade, and maintain production-grade Kubernetes clusters (kubeadm or Rancher RKE2) on bare-metal infrastructure, with GPU node pools configured via the NVIDIA GPU Operator.
Design and operate cluster networking using CNI plugins appropriate for high-throughput AI workloads — Calico, Cilium, or SR-IOV for RDMA-capable networking where required.
Configure and manage MetalLB or equivalent bare-metal load balancing, ingress controllers, and service mesh components (Istio or Linkerd) for secure intra-cluster communication.
Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints to ensure AI training jobs, inference services, and platform workloads coexist without resource contention.
Maintain cluster security posture: RBAC policies, Pod Security Admission, network policies, secrets management (HashiCorp Vault or Sealed Secrets), and CIS Kubernetes Benchmark compliance.

MLOps Pipelines & AI Workload Management

Deploy and operate MLOps platforms (MLflow, Kubeflow, or equivalent) for experiment tracking, model versioning, and pipeline orchestration across training and inference workloads.

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Configure and manage NVIDIA Triton Inference Server for multi-model serving, dynamic batching, and model ensemble execution on GPU nodes.
Build CI/CD pipelines for model deployment (GitOps with ArgoCD or Flux), including automated model validation, canary rollouts, and rollback mechanisms.
Optimize GPU utilization for both batch training jobs (Volcano or KUEUE scheduler) and latency-sensitive inference services, tracking efficiency metrics via DCGM and Prometheus.
Manage model artifact storage and versioning using software-defined storage backends (Ceph RBD/CephFS or MinIO) integrated with the MLOps toolchain.

Networking & Storage Architecture

Design and implement the high-bandwidth network fabric required for GPU cluster interconnects — InfiniBand, RoCE v2, or high-speed Ethernet — and ensure RDMA is correctly configured for distributed training workloads.
Deploy and operate software-defined storage solutions (Ceph or equivalent) providing block, object, and file storage tiers for training datasets, model checkpoints, and platform telemetry.
Configure network segmentation, VLANs, and firewall policies to meet NIST 800-171 requirements in on-premises and air-gapped environments; document network topology for client system security plans.
Establish and maintain VPN or secure tunneling solutions for hybrid connectivity between edge nodes, on-premises clusters, and any permitted cloud services.

Security, Compliance & Documentation

Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements: access control, audit logging, configuration management, incident response readiness, and media protection.
Maintain hardened OS baselines (RHEL/Rocky STIG or Ubuntu CIS benchmarks) across all infrastructure nodes; automate compliance scanning with OpenSCAP or equivalent.
Produce and maintain infrastructure documentation required for government procurement: network diagrams, hardware inventories, system security plan (SSP) contributions, and disaster recovery runbooks.
Support penetration testing engagements by providing accurate infrastructure context and remediating findings within agreed timelines.

Required Qualifications

6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.
Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.
Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.
Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.
Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery.

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.
Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.
Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.
Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.
Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.

What We Offer

Hands-on ownership of some of the most demanding AI infrastructure in the public sector — H200 GPU clusters, high-bandwidth interconnects, and purpose-built on-premises deployments.
A technically rigorous environment where your infrastructure decisions directly affect the reliability of mission-critical government operations.
Competitive, globally benchmarked compensation including base salary, equity, and performance bonus.
Fully remote with async-first culture; periodic travel to client facilities and team on-sites for cluster deployments and planning.
Access to cutting-edge NVIDIA hardware, early access to new GPU generations, and budget for relevant certifications (NVIDIA, CKA/CKS, RHCSA, etc.).
Collaboration with a Lead Architect and engineering team who understand infrastructure as a product — not just a cost center.

Benefits:

Comprehensive healthcare, dental, and vision coverage
401k plan
Paid time off (PTO)
And more!

Learn more about us at centific.com.

Centific is an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, citizenship status, age, mental or physical disability, medical condition, sex (including pregnancy), gender identity or expression, sexual orientation, marital status, familial status, veteran status, or any other characteristic protected by applicable law. We consider qualified applicants regardless of criminal histories, consistent with legal requirements.

Job Overview

Posted Date Mar 04, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company Centific

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

AWS Solutions Architect

Devops

•

34m ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

ImagineX

United State

Front End DevSecOps Engineer

Devops

•

1h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Wiraa

United State

AI Engineer

Devops

•

1h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

Bright Vision Technologies

United State

MLOps/AI Infrastructure Engineer

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

AWS Solutions Architect

ImagineX

Front End DevSecOps Engineer

Premium Job

Wiraa

AI Engineer

Bright Vision Technologies

Subscribe our newsletter