Senior IT Analyst Technical Infrastructure (Lead Site Reliability Engineer)
Join Caterpillar's IT team as a Lead Technology Specialist supporting Autonomy & Autonomous Business Unit. Provide end-to-end operational ownership of Kubernetes-based platform environments. Ensure reliable provisioning, configuration, monitoring, and continuous improvement of clusters and workloads.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
Career Area:
Technology, Digital and Data
Job Description:
Your Work Shapes the World at Caterpillar Inc.
When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.
Your Impact Shapes the World at Caterpillar Inc
When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.
Job Summary
We are seeking a skilled Senior IT Analyst Technical Infrastructure (Lead Site Reliability Engineer )to join PLM NPI -CAT IT Division.
Come work on the Caterpillar IT Team as a Lead Technology Specialist supporting Caterpillar's Autonomy & Autonomous Business Unit. The Autonomy and Automation team is focused on scaling technology solutions in mining, construction, quarry and aggregates and beyond to support customer safety and productivity goals. A&A is responsible for technology solutions including autonomy, semi-autonomy, remote control, and other technologies. The goal is to address key customer problems, including safety, productivity, labor shortage, energy transition and process optimization. In this role as Lead Site Reliability Engineer, you will provide end to end operational ownership of Kubernetes based platform environments deployed on on premises hardware and in AWS. Ensure reliable provisioning, configuration, monitoring, and continuous improvement of clusters and workloads. Perform bug triage and incident response, drive observability and automation, and partner with platform, networking, and application teams to meet reliability objectives and business needs.
The preference for this role is to be based out of Whitefield PSN Office -Bangalore, KA-India
What You Will Do
- Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).
- Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).
- Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.
- Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.
- Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.
- Lead incident response for reliability events (degradations, outages), post‑incident reviews, and preventive actions.
- Administer Kubernetes security controls (RBAC, network policies, secrets management, image signing/scanning), certificate management, and compliance control implementation.
- Manage platform services (container registry, ingress/controllers, CNI, storage classes/CSI, service mesh where applicable).
- Implement backup/restore and disaster recovery strategies for clusters and stateful workloads (e.g., Velero), validate regularly.
- Maintain and improve CI/CD workflows integrating testing, policy checks, and progressive delivery for platform and shared services.
- Create and maintain operational documentation: standards, diagrams, runbooks, automation playbooks, and knowledge base articles.
- Collaborate with networking, security, and application teams to ensure reliability, performance, and secure connectivity across data centers and AWS.
- Drive continuous improvement: reliability engineering practices, toil reduction, automation, and change management processes.
- Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
- Infrastructure as Code and automation and Git‑based CI/CD.
- Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
- Linux systems administration (container runtime, networking, storage.
- Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
- Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
- Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
- Strong documentation and communication skills; ability to work effectively with geographically distributed teams.
Looking to advance your IT & Network Engineering career with relocation support? Explore IT & Network Engineering Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
- Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
- Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
- Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
- Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
- Knowledge of platform registry operations, image lifecycle, and vulnerability management.
- This position requires candidate to work a 5-day -a -week schedule in the office
Technical Excellence: Knowledge of a given technology and various application methods; ability to develop and provide solutions to significant technical challenges.
Level Extensive Experience:
- Advises others on the assessment and provision of all technical solutions.
- Engages appropriate subject matter resources to effectively resolve technical issues.
- Mentors others to enhance their technical competence and its application to achieve more effective technical solutions.
- Coaches others in promoting, defining, analyzing, and providing superior technical solutions to business problems.
- Provides effective solutions to moderate technical challenges through strong technical competence, effectively examining implications of events and issues.
- Assumes accountability for personal technical performance and holds others responsible for theirs.
Level Working Knowledge:
- Assesses the current technology environment, expressed needs and initiatives of client organizations.
- Uses an effective consulting method to present technology solutions that resolve stated client business issues.
- Advises clients regarding a family of specific products, technologies or services in a technology domain.
- Demonstrates basic competence and sound business knowledge regarding specific products, technologies or services within a domain of technology expertise.
- Achieves consulting relationship rating of 'professional' by delivering timely, meaningful advice meeting client needs in a narrow set of specific technologies.
Level Extensive Experience:
- Evaluates IT hardware vendors in the market and selects the most suitable products for the organization.
- Guides employees on the integration of IT hardware throughout other organization-wide platforms.
- Supervises the implementation process of IT hardware ensuring consistency in productivity and overall effectiveness.
- Advises others on business standards and practices for IT hardware in order to meet designer requirements.
- Evaluates the advantages and disadvantages of an organization's hardware components.
- Diagnoses IT hardware problems and recommends dynamic solutions.
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
Level Working Knowledge:
- Follows policies, practices and standards for determining functional and informational requirements.
- Confirms deliverables associated with requirements analysis.
- Communicates with customers and users to elicit and gather client requirements.
- Participates in the preparation of detailed documentation and requirements.
- Utilizes specific organizational methods, tools and techniques for requirements analysis.
Level Extensive Experience:
- Verifies the proper flow of transactions across all input, output and storage channels or devices.
- Evaluates interoperability of new systems with existing systems during the beta testing phase.
- Supervises the testing of complex, multi-platform and distributed applications.
- Designs processes to ensure that the system meets and maintains requirements and expectations.
- Coaches end users on the development of test data and test scenarios for system validation.
- Manages the execution of test plans, including resources, strategies, schedules, processes and tools.
Level Working Knowledge:
- Reports software connectivity and integration issues.
- Demonstrates planned software changes on the local environment.
- Administers software migration and contingency plans related to own function.
- Analyzes the local software architecture components and products.
- Tests key features for the entire software infrastructure environment.
Level Extensive Experience:
- Emphasizes the business impact of failure and the criticality and timing of needed resolution so that problems can be avoided in the future.
- Creates trouble reports for all issues found and reviews solutions for completeness and correctness.
- Directs the resolution of communications problems in multi-vendor environments.
- Resolves a variety of hardware, software, and communications malfunctions.
- Coaches others on advanced diagnostic techniques and tools for unusual or performance-related problems.
- Facilitates the distribution of releases reports and correction packages to departments or clients.
Interested in relocating to India? Check out our comprehensive Relocation Jobs in India page with detailed relocation packages and benefits.
Level Extensive Experience:
- Conducts training on alternative documentation delivery mechanisms, tools and techniques.
- Manages cost items in producing and maintaining documentation.
- Designs and implements formal methodologies for producing documentation.
- Collaborates with support function managers, the product management team, and design engineers with writing projects.
- Supervises the analysis, design and data collation on large documentation initiatives.
- Establishes and references best practices for existing and planned tools and delivery vehicles for proper documentation.
- Work Life Harmony
- Earned and medical leave.
- Relocation assistance
- Personal and professional development through Caterpillar ‘s employee resource groups across the globe
- Career developments opportunities with global prospects
- Medical coverage -Medical, life and personal accident coverage
- Employee mental wellness assistance program
- Employee investment plan
- Pay for performance -Annual incentive Bonus plan.
Caterpillar is not currently hiring individuals for this position who now or in the future require sponsorship for employment visa status; however, as a global company, Caterpillar offers many job opportunities outside of the U.S. which can be found through our employment website at www.caterpillar.com/careers
Posting Dates:
February 18, 2026 - March 3, 2026
Caterpillar is an Equal Opportunity Employer. Qualified applicants of any age are encouraged to apply
Not ready to apply? Join our Talent Community.
Similar Jobs
Explore other opportunities that match your interests
Senior Electrical Engineer
Bechtel Corporation
Senior Systems Administrator
Shield AI
Staff Systems Administrator