Site Reliability Engineer (SRE) for GPU Infrastructure Data Centres

tgs international group โ€ข United Kingdom
Remote
Apply
AI Summary

We are seeking a Site Reliability Engineer (SRE) to ensure the end-to-end validation, testing, and readiness of GPU compute clusters prior to production release. The role requires strong hands-on experience administering and troubleshooting Linux systems, as well as proficiency in Python for automation and test execution. The SRE will work closely with global infrastructure and engineering teams to maintain the quality, stability, and integrity of high-performance compute environments.

Key Highlights
Cluster validation and testing
Orchestration and benchmarking
Test framework and automation
Key Responsibilities
Cluster validation and testing
Orchestration and benchmarking
Test framework and automation
Remediation and system integrity
Documentation and handover
Team collaboration and training
Technical Skills Required
Linux systems administration Python for automation and test execution Ansible playbooks
Benefits & Perks
Fully remote work
Company hardware provided
Nice to Have
Experience working with GPU-based or high-performance compute environments
Familiarity with workload schedulers
Understanding of data centre hardware lifecycle and server validation processes

Job Description


Site Reliability Engineer (SRE) โ€“ GPU Infrastructure Data Centres

Fully Remote Role - Work from home


The Site Reliability Engineer (SRE) is responsible for the end-to-end validation, testing, and readiness of GPU compute clusters prior to production release. The role ensures that all hardware, networking, and system components meet operational and reliability standards before customer workloads are deployed.

Working closely with global infrastructure and engineering teams, the SRE plays a critical role in maintaining the quality, stability, and integrity of high-performance compute environments.


Key Responsibilities


Cluster Validation & Testing

  • Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release
  • Perform functional and reliability testing of GPUs, servers, and associated components
  • Verify network connectivity and performance, including high-speed interconnects where applicable


Orchestration & Benchmarking

  • Provision and configure GPU clusters using automated workflows
  • Execute and analyse performance and stability benchmarks orchestrated via a workload scheduler
  • Validate results against expected performance and reliability thresholds


Test Framework & Automation

  • Maintain and extend the automated validation framework built using Python and Ansible
  • Integrate new test cases to support additional hardware platforms and GPU generations
  • Improve test reliability, coverage, and execution efficiency


Remediation & System Integrity

  • Diagnose and remediate unhealthy nodes through configuration changes or software fixes
  • Coordinate with on-site support teams for hardware replacements when required
  • Ensure all issues are resolved and documented prior to handover to production operations


Documentation & Handover

  • Produce clear, accurate documentation of test results, hardware states, and remediation actions
  • Ensure smooth handovers to operations and engineering teams
  • Maintain up-to-date runbooks and validation procedures


Team Collaboration & Training

  • Work as part of a distributed, international infrastructure and engineering team
  • Participate in knowledge sharing, process improvement, and technical reviews
  • The working language is English; additional language skills are beneficial


Shift & Availability Requirements

  • Ability to work independently within a remote environment
  • Reliable internet connection and suitable home working setup
  • Role is fully remote; company hardware will be provided


Skills & Experience

Essential

  • Strong hands-on experience administering and troubleshooting Linux systems
  • Confident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system services
  • Proven experience writing and maintaining Ansible playbooks
  • Proficiency in Python for automation, test execution, and parsing results
  • Strong analytical and problem-solving skills with attention to detail
  • Excellent written and verbal English communication skills
  • High standards for system reliability, consistency, and documentation


Preferred / Desirable

  • Experience working with GPU-based or high-performance compute environments
  • Familiarity with workload schedulers (e.g. Slurm or similar tools)
  • Understanding of data centre hardware lifecycle and server validation processes
  • Exposure to high-speed networking technologies
  • Experience working with distributed or remote infrastructure teams


Performance & Success Metrics

  • Accuracy and completeness of cluster validation prior to production release
  • Reduction in post-deployment hardware or configuration issues
  • Quality and clarity of validation documentation and handover materials
  • Effectiveness of remediation and coordination with on-site teams
  • Reliability and maintainability of automated test frameworks
  • Collaboration and communication quality with engineering and operations teams


Similar Jobs

Explore other opportunities that match your interests

Total Rewards Analyst/Total Rewards Specialist

Programming
โ€ข
27m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข
Job Type โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข
Experience Level โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข

Camunda

United Kingdom

Senior Backend Engineer (Go) for Decentralised Computing Platform

Programming
โ€ข
42m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข
Job Type โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข
Experience Level โ€ขโ€ขโ€ขโ€ขโ€ขโ€ข

owen thomas | b corpโ„ข

United Kingdom

Senior PHP Developer (Interim)

Programming
โ€ข
6h ago
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

delaney & bourton

United Kingdom

Subscribe our newsletter

New Things Will Always Update Regularly