Jobgether is seeking a Product Reliability Engineer to ensure production systems remain stable, observable, and continuously improvable. The role involves partnering with customers and internal teams to investigate and resolve complex production issues. Strong software engineering fundamentals and hands-on Kubernetes expertise are required.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Product Reliability Engineer based in Canada.
This role sits at the critical intersection of software engineering, customer reliability, and production operations for infrastructure software deployed in complex, real-world environments. You will ensure that production systems running in customer-owned Kubernetes environments remain stable, observable, and continuously improvable. The work goes beyond incident response, focusing on eliminating entire categories of failures through better tooling, automation, and product design. You will partner closely with customers, engineers, and solution teams to investigate complex issues, drive root-cause analysis, and translate findings into long-term system improvements. This is a highly hands-on role where debugging, automation, and product thinking come together to define reliability as a core product capability. Your work will directly shape how enterprise customers experience stability, performance, and trust in the platform.
Accountabilities
- Partner with customers and internal teams to investigate and resolve complex production issues across Kubernetes-based on-prem and hybrid deployments.
- Lead deep root-cause analysis for escalations, reproduce issues, and collaborate with engineering teams to implement durable fixes.
- Build and maintain reliability tooling such as diagnostics systems, health checks, support bundles, and environment validation utilities.
- Own and improve test automation frameworks, focusing on CI stability, reducing flaky tests, and strengthening integration and end-to-end coverage.
- Define and maintain performance baselines, regression testing frameworks, and reliability gates to prevent production regressions.
- Improve installation, upgrade, and deployment reliability by identifying recurring failure patterns and building preventive solutions.
- Develop production-grade internal tools and product enhancements using Python, Go, or Rust to strengthen observability and system resilience.
- Establish a closed feedback loop from customer issues to engineering improvements in testing, observability, documentation, and defaults.
Interested in remote work opportunities in Development & Programming? Discover Development & Programming Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- 4-7 years of experience in production engineering, SRE, platform engineering, or similar roles focused on reliability and distributed systems.
- Strong software engineering fundamentals, including debugging, testing, system design, and production-grade coding practices.
- Hands-on Kubernetes expertise, including troubleshooting workloads, networking, storage, RBAC, and multi-environment deployments.
- Strong experience with observability tools and techniques, including logs, metrics, and tracing for distributed system debugging.
- Proficiency in at least one programming language such as Python, Go, or Rust, with experience building internal tools or production systems.
- Strong analytical and communication skills, with the ability to break down complex incidents into clear root causes and actionable recommendations.
- Experience working in cross-functional environments with engineering, product, and customer-facing teams in fast-moving contexts.
- Self-directed and comfortable working in remote-first environments with shifting priorities driven by customer needs and escalations.
- Competitive compensation package aligned with experience and seniority
- Fully remote work environment across Canada and the United States
- Opportunity to work on real-world production infrastructure used in complex enterprise environments
- Strong technical ownership with high impact on product reliability and customer experience
- Collaboration with experienced engineers in infrastructure, automation, and platform engineering
- Learning and growth opportunities in Kubernetes, observability, and large-scale distributed systems
- Inclusive and diverse team culture focused on collaboration and continuous improvement
- Exposure to open-source-driven infrastructure innovation
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Similar Jobs
Explore other opportunities that match your interests
Jobgether
Backend and Platform Engineer for Docker Desktop
Docker, Inc
Senior Technical Support Specialist - Healthcare Software