Jobgether is seeking a Site Reliability Engineer to ensure the reliability, scalability, and performance of complex systems across cloud and on-premises environments. The ideal candidate will have experience in infrastructure engineering and operational best practices. This role involves hands-on management of large-scale data centers, automation of deployment workflows, and integration of observability tools.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer in United States.
This role is responsible for ensuring the reliability, scalability, and performance of complex systems across cloud and on-premises environments. The Site Reliability Engineer will work closely with development, operations, and product teams to design and maintain resilient infrastructure, implement CI/CD pipelines, and manage containerized applications and Kubernetes clusters. You will proactively monitor system performance, troubleshoot critical issues, and optimize operational processes to maintain high service availability. This position involves hands-on management of large-scale data centers, automation of deployment workflows, and integration of observability tools. The ideal candidate is highly analytical, detail-oriented, and experienced in both infrastructure engineering and operational best practices. Success in this role directly impacts system uptime, operational efficiency, and overall customer satisfaction.
Accountabilities
- Design, implement, and maintain scalable, highly available infrastructure using containers, microservices, and Kubernetes.
- Monitor system performance, troubleshoot reliability issues, and ensure optimal operation of both cloud-based and on-premises systems.
- Manage CI/CD pipelines and GitOps workflows, including ArgoCD, Helm charts, and Kustomize configurations for efficient software deployment.
- Implement configuration management processes using tools like Ansible to ensure consistent environments across data centers.
- Operate and optimize high-throughput Kafka clusters for event streaming, including replication, partitioning, and disaster recovery strategies.
- Collaborate with development teams to influence system design, operational policies, and best practices.
- Maintain comprehensive technical documentation, runbooks, architectural diagrams, and incident response procedures.
- Participate in on-call rotations and conduct blameless post-mortems for critical incidents.
- Continuously evaluate emerging technologies to enhance operational efficiency and reliability.
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- Bachelor’s degree in Computer Science, Engineering, or a related field; advanced degree preferred.
- 5+ years of experience in site reliability engineering or a related field focused on production systems and service delivery.
- Strong Linux systems expertise, including configuration, tuning, and troubleshooting.
- Hands-on experience with containers, Kubernetes, and microservices architecture.
- Proficient in CI/CD pipeline management and GitOps workflows, including ArgoCD, Helm charts, and automation tools.
- Experience with observability tools such as Prometheus, Grafana, and ELK Stack.
- Proven ability to manage large on-premises data centers with hundreds of bare metal servers and VMs.
- Familiarity with networking concepts, protocols, and configuration management tools.
- Strong analytical and troubleshooting skills with the ability to resolve complex system issues.
- Excellent communication skills and experience collaborating across cross-functional teams.
- Competitive salary: $118,000–$158,000 USD, depending on experience and location.
- Comprehensive medical, dental, and vision coverage for employees and dependents.
- Employer-paid income protection benefits including life, AD&D, short- and long-term disability.
- Flexible spending accounts for healthcare and dependent care.
- Retirement plan with 401(k) and employer match, plus Roth options.
- Employee Stock Purchase Plan (ESPP) and potential bonuses.
- Paid time off, sick leave, and company-observed holidays.
- Employee Assistance Program and additional perks such as commuter benefits, discount programs, and identity theft protection.
- Fully remote work opportunity within the U.S.
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Similar Jobs
Explore other opportunities that match your interests
Lead DevOps Engineer
Jobs via Dice
Generative AI Platforms Architect
Jobs via Dice