J

Senior Platform Reliability Engineer

Jobgether • United State
Remote
Apply
AI Summary

Ensure stability, scalability, and performance of large-scale distributed systems. Define SLOs, lead incident response, and improve operational workflows. Collaborate with engineering teams to design fault-tolerant systems.

Key Highlights
Define and monitor SLOs
Lead incident response efforts
Design and implement observability solutions
Key Responsibilities
Define, monitor, and continuously improve service-level objectives (SLOs), SLIs, and error budgets
Lead incident response efforts, including acting as incident commander, coordinating resolution, and driving post-incident reviews
Design and implement observability solutions using modern tooling for monitoring, logging, tracing, and alerting
Build and maintain automation tools to eliminate operational toil and improve system efficiency and repeatability
Architect and operate Kubernetes-based infrastructure, including scaling, networking, and workload optimization
Develop and improve CI/CD pipelines to support safe, frequent, and reliable software delivery
Conduct capacity planning, performance engineering, and reliability testing, including load and chaos testing initiatives
Partner with engineering teams to embed reliability, security, and fault tolerance into system design from the outset
Improve system resilience through redundancy, failover strategies, and proactive dependency management
Mentor engineers and contribute to a strong operational excellence culture across the organization
Technical Skills Required
Python Kubernetes Observability tools (Prometheus, Grafana, OpenTelemetry, ELK/EFK)
Benefits & Perks
Competitive salary range of $135,000 - $185,000 annually
100% remote position within the United States
Full-time W2 employment with long-term stability

Job Description


This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Platform Reliability Engineer based in the United States.

This role is focused on ensuring the stability, scalability, and performance of large-scale distributed systems that power critical business services.

You will operate at the intersection of software engineering and infrastructure, building automation and observability solutions that reduce operational friction and improve system resilience.

The position plays a key role in defining and maintaining service reliability standards, including SLOs and incident response practices.

You will work closely with engineering teams to design systems that are fault-tolerant, highly available, and production-ready from day one.

The environment is fast-paced and highly technical, emphasizing automation, continuous improvement, and strong engineering discipline.

You will also be responsible for improving deployment practices, monitoring systems, and operational workflows across production platforms.

This is a hands-on role where your work directly impacts uptime, customer experience, and engineering efficiency.

Accountabilities

  • Define, monitor, and continuously improve service-level objectives (SLOs), SLIs, and error budgets to guide reliability priorities.
  • Lead incident response efforts, including acting as incident commander, coordinating resolution, and driving post-incident reviews.
  • Design and implement observability solutions using modern tooling for monitoring, logging, tracing, and alerting.
  • Build and maintain automation tools to eliminate operational toil and improve system efficiency and repeatability.
  • Architect and operate Kubernetes-based infrastructure, including scaling, networking, and workload optimization.
  • Develop and improve CI/CD pipelines to support safe, frequent, and reliable software delivery.
  • Conduct capacity planning, performance engineering, and reliability testing, including load and chaos testing initiatives.
  • Partner with engineering teams to embed reliability, security, and fault tolerance into system design from the outset.
  • Improve system resilience through redundancy, failover strategies, and proactive dependency management.
  • Mentor engineers and contribute to a strong operational excellence culture across the organization.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of experience in Site Reliability Engineering, DevOps, or production infrastructure roles.
  • Strong programming skills in Python, Go, or Java for automation and tooling development.
  • Hands-on experience operating Linux-based production systems at scale, including networking and performance tuning.
  • Proven experience managing Kubernetes environments and containerized workloads in production.
  • Strong knowledge of observability stacks such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or similar tools.
  • Experience building CI/CD pipelines and supporting production deployment workflows.
  • Solid understanding of distributed systems concepts, including fault tolerance and system consistency.
  • Demonstrated experience in incident management and production troubleshooting.
  • Strong communication skills with ability to collaborate across engineering and operations teams.
  • Experience with cloud platforms (AWS, Azure, or GCP) and familiarity with reliability engineering practices such as SLOs or chaos engineering is a plus.

Benefits

  • Competitive salary range of $135,000 - $185,000 annually.
  • 100% remote position within the United States.
  • Full-time W2 employment with long-term stability.
  • Opportunity to work on large-scale distributed systems and mission-critical infrastructure.
  • Exposure to modern cloud, Kubernetes, and observability ecosystems.
  • Career growth through technical leadership, mentoring, and ownership of reliability strategy.
  • Inclusive and equal opportunity workplace culture.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.


Similar Jobs

Explore other opportunities that match your interests

Technical Services Engineer

Programming
•
29m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

MongoDB

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Trissential

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

handshake

United State

Subscribe our newsletter

New Things Will Always Update Regularly