Head of Data Center Operations

Blue Signal Search United State
Remote
This Job is No Longer Active This position is no longer accepting applications

Job Description


Head of Data Center Operations

Location: Remote (United States) – Preference for candidates based in the greater Bay Area


Our client is a well-funded, high-growth innovator delivering large-scale GPU compute for cutting-edge AI workloads. As demand accelerates, they are scaling multiple, next-generation data-center clusters across the country. They are seeking a strategic, hands-on Head of Data Center Operations to safeguard uptime, performance, and growth of this mission-critical infrastructure. If you thrive in hyper-scalable environments and enjoy shaping world-class operational teams, this role offers an unmatched opportunity to define the gold standard for GPU data-center reliability.


This Role Offers

  • Executive-level influence over a rapidly expanding GPU cloud platform.
  • Remote-first culture with high ownership, technical depth, and autonomy.
  • Direct impact on reliability engineering strategy during multi-megawatt capacity growth.
  • Competitive base salary, performance-based equity, and comprehensive benefits.
  • Chance to lead real-time operations at the forefront of AI infrastructure innovation.


Key Responsibilities

  • Direct the 24×7 operations of geographically distributed, high-density GPU data centers totaling tens of megawatts of compute capacity.
  • Establish and continuously improve monitoring, incident response, and change-management processes to ensure industry-leading uptime and performance.
  • Drive adoption of reliability-engineering best practices, creating playbooks, automation, and tooling that scale with rapid capacity growth.
  • Partner with hardware, facilities, and platform-engineering teams to optimize resource utilization, thermal efficiency, and service quality.
  • Manage vendor and colocation relationships, negotiating SLAs for power, cooling, and network connectivity.
  • Lead and mentor a global team of site-reliability engineers, NOC staff, and systems operators.
  • Oversee compliance programs covering security, disaster recovery, business continuity, and environmental regulations.
  • Analyze incidents and performance trends to identify systemic risks and implement preventive solutions.


Skill Set & Qualifications

  • 10+ years in data-center or large-scale infrastructure operations, including hyperscale, GPU, or HPC environments.
  • Proven track record operating live production workloads at 20 MW or greater total capacity.
  • Expert knowledge of observability, telemetry, and alerting systems for distributed infrastructure.
  • Familiarity with GPU workloads, thermal dynamics, and high-density rack design.
  • Exceptional incident-management and root-cause-analysis skills.
  • Demonstrated success building and scaling remote, globally distributed operations teams.
  • Startup or high-growth environment experience strongly preferred.


Ready to lead the next leap in AI infrastructure reliability? Apply today to explore how your experience can power the future of large-scale GPU compute.


About Blue Signal:

Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS


Subscribe our newsletter

New Things Will Always Update Regularly