Senior Platform Reliability Engineer

Jobgether • United State

Remote

Apply

AI Summary

Ensure stability, scalability, and performance of large-scale distributed systems. Define SLOs, lead incident response, and improve operational workflows. Collaborate with engineering teams to design fault-tolerant systems.

Key Highlights

Define and monitor SLOs

Lead incident response efforts

Design and implement observability solutions

Key Responsibilities

Define, monitor, and continuously improve service-level objectives (SLOs), SLIs, and error budgets

Lead incident response efforts, including acting as incident commander, coordinating resolution, and driving post-incident reviews

Design and implement observability solutions using modern tooling for monitoring, logging, tracing, and alerting

Build and maintain automation tools to eliminate operational toil and improve system efficiency and repeatability

Architect and operate Kubernetes-based infrastructure, including scaling, networking, and workload optimization

Develop and improve CI/CD pipelines to support safe, frequent, and reliable software delivery

Conduct capacity planning, performance engineering, and reliability testing, including load and chaos testing initiatives

Partner with engineering teams to embed reliability, security, and fault tolerance into system design from the outset

Improve system resilience through redundancy, failover strategies, and proactive dependency management

Mentor engineers and contribute to a strong operational excellence culture across the organization

Technical Skills Required

Python Kubernetes Observability tools (Prometheus, Grafana, OpenTelemetry, ELK/EFK)

Benefits & Perks

Competitive salary range of $135,000 - $185,000 annually

100% remote position within the United States

Full-time W2 employment with long-term stability

Job Description

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Platform Reliability Engineer based in the United States.

This role is focused on ensuring the stability, scalability, and performance of large-scale distributed systems that power critical business services.

You will operate at the intersection of software engineering and infrastructure, building automation and observability solutions that reduce operational friction and improve system resilience.

The position plays a key role in defining and maintaining service reliability standards, including SLOs and incident response practices.

You will work closely with engineering teams to design systems that are fault-tolerant, highly available, and production-ready from day one.

The environment is fast-paced and highly technical, emphasizing automation, continuous improvement, and strong engineering discipline.

You will also be responsible for improving deployment practices, monitoring systems, and operational workflows across production platforms.

This is a hands-on role where your work directly impacts uptime, customer experience, and engineering efficiency.

Accountabilities

Define, monitor, and continuously improve service-level objectives (SLOs), SLIs, and error budgets to guide reliability priorities.
Lead incident response efforts, including acting as incident commander, coordinating resolution, and driving post-incident reviews.
Design and implement observability solutions using modern tooling for monitoring, logging, tracing, and alerting.
Build and maintain automation tools to eliminate operational toil and improve system efficiency and repeatability.
Architect and operate Kubernetes-based infrastructure, including scaling, networking, and workload optimization.
Develop and improve CI/CD pipelines to support safe, frequent, and reliable software delivery.
Conduct capacity planning, performance engineering, and reliability testing, including load and chaos testing initiatives.
Partner with engineering teams to embed reliability, security, and fault tolerance into system design from the outset.

Interested in remote work opportunities in Development & Programming? Discover Development & Programming Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Improve system resilience through redundancy, failover strategies, and proactive dependency management.
Mentor engineers and contribute to a strong operational excellence culture across the organization.

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related technical field.
5+ years of experience in Site Reliability Engineering, DevOps, or production infrastructure roles.
Strong programming skills in Python, Go, or Java for automation and tooling development.
Hands-on experience operating Linux-based production systems at scale, including networking and performance tuning.
Proven experience managing Kubernetes environments and containerized workloads in production.
Strong knowledge of observability stacks such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or similar tools.
Experience building CI/CD pipelines and supporting production deployment workflows.
Solid understanding of distributed systems concepts, including fault tolerance and system consistency.
Demonstrated experience in incident management and production troubleshooting.
Strong communication skills with ability to collaborate across engineering and operations teams.
Experience with cloud platforms (AWS, Azure, or GCP) and familiarity with reliability engineering practices such as SLOs or chaos engineering is a plus.

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Benefits

Competitive salary range of $135,000 - $185,000 annually.
100% remote position within the United States.
Full-time W2 employment with long-term stability.
Opportunity to work on large-scale distributed systems and mission-critical infrastructure.
Exposure to modern cloud, Kubernetes, and observability ecosystems.
Career growth through technical leadership, mentoring, and ownership of reliability strategy.
Inclusive and equal opportunity workplace culture.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Job Overview

Posted Date Jul 04, 2026

Employment Type Full-time

Experience Level Not Applicable

Location United State

Category Programming

Company Jobgether

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Technical Services Engineer

Programming

•

29m ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

MongoDB

United State

Senior Software Engineer / BI Developer - Cloud-Based Data Solutions

Programming

•

1h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

Trissential

United State

AI Code Evaluator - Software Engineer (LATAM, Hourly Contract)

Programming

•

1h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

handshake

United State

Senior Platform Reliability Engineer

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Technical Services Engineer

Premium Job

MongoDB

Senior Software Engineer / BI Developer - Cloud-Based Data Solutions

Trissential

AI Code Evaluator - Software Engineer (LATAM, Hourly Contract)

handshake

Subscribe our newsletter