Site Reliability Engineer (SRE) - Observability and Reliability

Optomi • United State
Remote
This Job is No Longer Active This position is no longer accepting applications
AI Summary

Design, scale, and optimize observability solutions using Prometheus, Grafana, and Dynatrace. Apply SRE principles to improve application reliability and drive reliability maturity across multi-team environments.

Key Highlights
Design and manage Prometheus and Grafana environments
Apply SRE principles to improve application reliability
Drive reliability maturity across multi-team environments
Technical Skills Required
Prometheus Grafana PromQL Dynatrace Kubernetes Cloud platforms (AWS/GCP/Azure) CI/CD pipelines

Job Description


Site Reliability Engineer

*6-12 month contract*

Fully Remote


Optomi, in partnership with one of our premier clients in the telecommunication industry, is seeking a highly skilled Site Reliability Engineer (SRE) with strong observability expertise, proven communication skills, and the ability to drive reliability maturity across multi-team environments. This role is ideal for someone who can blend deep technical proficiency with strategic thinking and collaborative influence.


Key Responsibilities

Observability Engineering

  • Design, scale, optimize, and manage Prometheus and Grafana environments.
  • Write advanced PromQL queries, dashboards, visualizations, and metric-based calculations.
  • Build out and maintain Grafana instances, supporting multi-team use cases.
  • Leverage Dynatrace with strong proficiency in metrics and analytics to deliver efficient, actionable observability solutions for engineering and operations teams (e.g., dashboards, insights, reports).
  • Analyze telemetry data to identify the metrics that matter (MTM), drive actionable insights, and influence engineering decisions.


Site Reliability Engineering

  • Apply and evolve an SRE Maturity Model to help teams mature across observability, resilience, automation, and reliability.
  • Establish, implement, and maintain Service Level Objectives (SLOs) and error budgets across applications and services.
  • Partner effectively with engineering, product, operations, and leadership teams; translate complex technical insights into clear, actionable communication.
  • Identify and reduce toil through automation, tooling improvements, and process refinement.
  • Support incident analysis, reliability reviews, and continuous improvement initiatives.


Required Skills & Experience

  • Familiarity with SRE principles, maturity models, and reliability roadmaps.
  • Demonstrated experience improving application reliability via data-driven decisions.
  • Hands-on experience with Prometheus, Grafana, and PromQL.
  • Strong understanding of Dynatrace, metric analysis, and observability practices.
  • Excellent communication skills and ability to collaborate across diverse technical and non-technical teams.
  • Strong analytical and problem-solving skills with a bias for action.


Nice to Have

  • Experience with Kubernetes, cloud platforms (AWS/GCP/Azure), or CI/CD pipelines.
  • Experience with automation.
  • Experience with large-scale distributed systems or high-availability architectures.

Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Associate

Largeton Group

United State

Platform Engineer

Devops
•
19m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Cube

United State

Senior DevOps Engineer

Devops
•
5h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

theoris

United State

Subscribe our newsletter

New Things Will Always Update Regularly