Site Reliability Engineer

Jobgether • United State
Remote Visa Sponsorship
Apply
AI Summary

Jobgether is seeking an experienced Site Reliability Engineer to ensure the stability, scalability, and performance of modern cloud-native systems. The ideal candidate will thrive in highly scalable, distributed environments and have a passion for operational excellence. This is a fully remote opportunity with long-term career growth and exposure to complex, enterprise-grade platforms.

Key Highlights
Define and implement service reliability standards
Lead incident response efforts
Design and maintain observability frameworks
Key Responsibilities
Define, implement, and continuously improve service reliability standards
Lead incident response efforts
Design and maintain observability frameworks
Develop automation tools and operational workflows
Architect, manage, and optimize Kubernetes-based infrastructure
Build and improve CI/CD pipelines
Technical Skills Required
Python Go Bash Linux systems administration Kubernetes Prometheus Grafana OpenTelemetry ELK/EFK Datadog
Benefits & Perks
100% remote work opportunity
Full-time direct W2 employment
Competitive base salary
Comprehensive employee benefits package

Job Description


This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer (SRE) in United States.

This opportunity is ideal for an experienced reliability engineer who thrives in highly scalable, distributed environments and is passionate about operational excellence. In this role, you will play a critical part in ensuring the stability, scalability, and performance of modern cloud-native systems while collaborating closely with engineering and infrastructure teams. The position offers the chance to work on long-term, high-impact initiatives focused on automation, observability, resilience, and continuous delivery. You will help shape reliability standards, optimize production systems, and reduce operational overhead through engineering-driven solutions. The environment encourages innovation, technical leadership, and proactive problem-solving, making it a strong fit for professionals who enjoy balancing software engineering with systems operations. This is a fully remote opportunity within the United States, offering long-term career growth and exposure to complex, enterprise-grade platforms.

Accountabilities

  • Define, implement, and continuously improve service reliability standards through SLOs, SLIs, and error budget management for critical production services.
  • Lead incident response efforts, coordinate production issue resolution, and conduct detailed post-incident reviews to strengthen system resilience and operational maturity.
  • Design and maintain observability frameworks using monitoring, logging, and tracing tools such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or Datadog.
  • Develop automation tools and operational workflows using Python, Go, Bash, or similar technologies to eliminate repetitive manual tasks and improve system efficiency.
  • Architect, manage, and optimize Kubernetes-based infrastructure, including autoscaling, networking, capacity planning, and container orchestration.
  • Build and improve CI/CD pipelines that support safe deployments, automated testing, canary releases, and progressive rollout strategies.
  • Partner with development teams to embed reliability, fault tolerance, and graceful degradation practices early in the software design lifecycle.
  • Drive initiatives related to chaos engineering, performance testing, security hardening, failover readiness, and platform resiliency improvements.
  • Mentor engineers on SRE best practices while contributing to a collaborative, blameless culture focused on continuous improvement.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of professional experience in Site Reliability Engineering, DevOps, production engineering, or infrastructure-focused roles supporting distributed systems.
  • Strong programming and scripting experience with Python, Go, Java, Bash, or similar languages used for automation and tooling development.
  • Deep expertise in Linux systems administration, networking concepts, systems troubleshooting, and performance optimization.
  • Hands-on experience managing Kubernetes clusters and containerized production workloads at scale.
  • Strong understanding of observability practices and modern monitoring ecosystems including Prometheus, Grafana, OpenTelemetry, ELK/EFK, or equivalent platforms.
  • Experience designing and maintaining CI/CD pipelines and deployment automation processes.
  • Solid knowledge of distributed systems concepts, including reliability engineering, failure handling, partitioning, and scalability principles.
  • Proven experience leading incident management processes and conducting actionable post-mortem reviews.
  • Excellent communication, collaboration, and technical documentation skills.
  • Additional exposure to cloud platforms such as AWS, Azure, or GCP, along with service mesh technologies or chaos engineering practices, is highly valued.

Benefits

  • 100% remote work opportunity across the Continental United States.
  • Full-time direct W2 employment with long-term project stability.
  • Competitive base salary aligned with experience and technical expertise.
  • Comprehensive employee benefits package, including healthcare coverage and additional employee perks.
  • Opportunity to work on multi-year engineering initiatives involving modern cloud-native technologies and enterprise-scale infrastructure.
  • Supportive environment focused on technical growth, mentorship, and career advancement.
  • Exposure to cutting-edge reliability, automation, and observability practices in a collaborative engineering culture.
  • H1B transfer support available for qualified candidates currently holding valid H1B status.
  • Flexible, remote-first work environment designed to support productivity and work-life balance.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.


Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Entry level

clera

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

clera

United State

Staff Engineer - Site Reliability Engineer

Devops
•
10h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

GEICO

United State

Subscribe our newsletter

New Things Will Always Update Regularly