Senior Platform & Site Reliability Engineer - Cloud Platform Leadership
Lead platform architecture and engineering standards for a rapidly growing enterprise software organization. Own and operate shared cloud platform, CI/CD, observability, and event streaming infrastructure. Collaborate with U.S.-based team, available until at least 3:00 PM EST.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
Our client is a rapidly growing enterprise software organization that acquires and scales B2B SaaS products. They are building a shared cloud platform that serves as the engineering foundation for a growing portfolio of enterprise applications. This platform provides standardized infrastructure, deployment, observability, automation, and reliability capabilities across multiple products while enabling future growth without proportionally increasing operational complexity.
The organization is investing in modern platform engineering practices, cloud-native technologies, Infrastructure as Code, AI-assisted engineering, and operational automation to build a scalable, highly reliable engineering ecosystem.
They are looking for an experienced Senior Platform & Site Reliability Engineer to take ownership of the shared platform, establish engineering standards, and design the infrastructure that supports multiple enterprise SaaS products. This is a hands-on technical leadership role where you will influence platform architecture, developer experience, operational reliability, and engineering best practices.
Working Hours: This role requires daily collaboration with a U.S.-based engineering team. Candidates must be available to work until at least 3:00 PM EST (U.S. Eastern Time), with flexibility to work beyond these hours when business needs require.
Responsibilities
Platform Engineering
- Own the architecture and operation of the shared platform, including CI/CD, observability, deployment automation, secrets management, and developer tooling.
- Define, implement, and enforce platform engineering standards across multiple products.
- Build and maintain Infrastructure as Code using Terraform or OpenTofu, ensuring all infrastructure is version-controlled, reviewed, and provisioned through automation.
- Develop self-service platform capabilities that enable engineering teams to deploy independently.
Event Streaming & Data Processing
- Design and maintain event streaming infrastructure supporting real-time processing workloads.
- Build and support batch processing infrastructure alongside live transactional systems.
- Ensure reliability, scalability, performance, and cost efficiency of platform services.
CI/CD & Deployment
- Design, build, and maintain CI/CD pipelines using GitHub Actions.
- Automate recovery for common pipeline failures and improve deployment reliability.
- Implement release management strategies, rollback mechanisms, and deployment patterns such as canary or blue-green deployments where appropriate.
Observability & Site Reliability
- Own and maintain the observability platform using Grafana, Prometheus, Loki, CloudWatch, and related monitoring tools.
- Define Service Level Objectives (SLOs), error budgets, and reliability metrics across multiple products.
- Build intelligent alerting and monitoring solutions that provide actionable diagnostic information.
- Design incident response processes, escalation procedures, and post-incident review practices.
- Implement safe automated remediation for well-understood operational scenarios while ensuring human oversight for complex incidents.
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
Platform Expansion & Integration
- Assess newly onboarded products for infrastructure maturity, Infrastructure as Code coverage, observability, and security.
- Plan and execute platform integration and modernization initiatives while minimizing operational disruption.
- Support the adoption of standardized platform capabilities across multiple engineering teams.
Engineering Automation
- Leverage AI-assisted engineering tools and automation where appropriate to reduce operational overhead.
- Automate infrastructure provisioning, CI/CD workflows, monitoring, secrets management, and operational tasks while maintaining engineering oversight for high-impact decisions.
Preferred Technology Stack
- AWS
- Terraform / OpenTofu
- GitHub Actions
- Grafana
- Prometheus
- Loki
- AWS CloudWatch
- AWS Secrets Manager or HashiCorp Vault
- Amazon ECS and EKS
- Event streaming technologies
- Cost monitoring and cloud optimization tools
Requirements
- 8–12 years of experience in Platform Engineering, Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure Engineering.
- Proven experience designing and operating production platform infrastructure across multiple environments or products.
- Strong hands-on experience with Terraform (or OpenTofu) and Infrastructure as Code.
- Extensive experience designing and maintaining CI/CD pipelines using GitHub Actions.
- Experience operating event streaming infrastructure in production environments.
- Strong AWS expertise, including ECS, EKS, IAM, VPC, RDS, CloudWatch, networking, and cloud infrastructure.
- Hands-on experience with Grafana, Prometheus, Loki, and enterprise observability platforms.
- Strong understanding of SRE principles, including SLOs, error budgets, incident response, and operational excellence.
- Experience designing scalable, secure, highly available cloud infrastructure.
- Strong troubleshooting, automation, and problem-solving skills.
- Excellent communication skills with the ability to establish engineering standards across multiple teams.
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
Nice to Have
- Experience building shared platform engineering capabilities supporting multiple products or business units.
- Experience integrating newly acquired products or modernizing legacy platforms.
- Experience designing developer self-service platforms.
- Familiarity with AI-assisted engineering workflows and infrastructure automation.
- Experience supporting high-volume enterprise SaaS products and distributed systems.
- Strong focus on cloud cost optimization and operational efficiency.
What We Offer
- Competitive market salary.
- Fully remote work.
- Opportunity to build and shape the engineering platform supporting a growing portfolio of enterprise SaaS products.
- Work alongside experienced international engineering teams.
- Exposure to modern cloud technologies, AI-assisted engineering, automation, and large-scale platform initiatives.
- Professional growth through ownership of platform architecture, operational reliability, and engineering standards.
- Daily collaboration with a U.S.-based engineering team, with availability required until at least 3:00 PM EST and flexibility to work longer when needed.
Similar Jobs
Explore other opportunities that match your interests
Senior Cloud Platform Engineer (Remote, Portugal)
TMC
DataCareers