Design, build, and maintain highly available, scalable, and secure infrastructure for DigitalXC AI's GenAI and automation platform. Monitor system performance, manage incident response, perform root-cause analysis, and implement reliability improvements. Collaborate with engineering teams on CI/CD pipelines, observability, capacity planning, and disaster recovery.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
Company Description DigitalXC AI is a GenAI-powered hyper-automation and employee experience platform focused on transforming enterprise IT operations and support. The platform enables self-service, self-heal, self-help, and operations automation across major IT domains, backed by an app store of 650+ prebuilt automated services that can drive 50–60% automation within 12–18 months. DigitalXC AI delivers a consumer-grade, omnichannel experience through web and mobile apps, chat and voice bots, and integrations with tools like ServiceNow. Its intelligent virtual assistants and AI agents enhance productivity by supporting user queries, content creation, enterprise search, technical support, and more. The platform integrates with a wide range of enterprise technologies, including cloud, digital workplace, service desk, DevOps, networks, security, and leading business applications.
Role Description This is a full-time, remote role for a Site Reliability Engineer at DigitalXC AI. The Site Reliability Engineer will design, build, and maintain highly available, scalable, and secure infrastructure that powers the company’s GenAI and automation platform. Day-to-day responsibilities include monitoring system performance, managing incident response, performing root-cause analysis, and implementing reliability and performance improvements. The role involves collaborating with software engineering teams to design resilient services, automate deployments, improve observability, and implement best practices for capacity planning and disaster recovery. The Site Reliability Engineer will also help define and refine SLOs/SLIs, manage CI/CD pipelines, and contribute to tooling that reduces operational toil.
Qualifications
- Candidates should possess strong Site Reliability Engineering skills, including observability, incident management, capacity planning, and reliability best practices.
- Candidates should possess deep System Administration and Infrastructure skills, such as managing Linux-based systems, cloud platforms (e.g., AWS, Azure, GCP), networking basics, and infrastructure-as-code tooling.
- Candidates should possess solid Software Development skills, including proficiency in at least one programming or scripting language (e.g., Python, Go, Java, or Bash) and experience building automation and internal tools.
- Candidates should possess advanced Troubleshooting skills for diagnosing complex production issues across applications, infrastructure, and third-party integrations.
- Experience with CI/CD pipelines, containers and orchestration (e.g., Docker, Kubernetes), and monitoring/logging stacks (e.g., Prometheus, Grafana, ELK, or similar) is highly beneficial.
- Understanding of security best practices for cloud-native environments, including access control, secrets management, and patching, is preferred.
- Effective communication skills, a collaborative mindset, and the ability to work independently in a remote, distributed team are essential.
- Bachelor’s degree in Computer
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
Similar Jobs
Explore other opportunities that match your interests