Back to Search
Overview
Mid-Level

Site Reliability Engineer- Vice President

Confirmed live in the last 24 hours

Citigroup

Citigroup

Pune Maharashtra India
On-site
Posted April 1, 2026

Job Description

The Site Reliability Engineer (SRE) is a strategic professional accountable for the daily operations, architectural resilience, and overall implementation of SRE principles in a complex, critical, and largescale multi-disciplinary environment. This role requires a comprehensive understanding of multiple technology domains and their interaction to achieve business objectives. As a recognized technical authority, you will apply an in depth understanding of the business impact of technical contributions and provide advice and counsel on strategic solutions.

We are seeking a passionate and experienced SRE to join our Production Management team. In this role, you will be instrumental in enhancing the reliability, performance, and efficiency of our Applications and Services. You will drive our strategy for end-to-end observability and resiliency, collaborating across the organization to ensure our services are stable, scalable, and fault tolerant. This is a key role that will influence strategic decisions and foster a culture of technical excellence and accountability.

Key Responsibilities

Culture & Strategy

  • Foster a culture of transparency, innovation, and accountability that encourages continuous improvement.
  • Communicate the progress and impact of SRE initiatives to stakeholders at all levels.
  • Operate effectively within a highly regulated environment, ensuring compliance with all relevant requirements.

Resiliency & Recovery

  • Ensure critical business applications meet stringent operational resilience requirements, including adherence to defined impact tolerances.
  • Oversee advanced recovery testing, including Production Swing Tests, Data Recovery Tests, and chaos engineering practices.
  • Drive the adoption and development of automation, such as One Touch Recovery solutions, to minimize recovery time.
  • Partner with development teams to leverage cloud native services and established resiliency patterns to enhance application reliability.

Observability & Performance

  • Collaborate across the organization to develop and scale observability solutions using modern tools for metrics, logging, and tracing.
  • Partner with development teams to effectively instrument applications, providing deep insights into system health and performance.

Qualifications:

  • 7+ Years of Experience is a must have.
  • Deep understanding of SRE concepts, including SLOs, SLIs, error budgets, and toil reduction.
  • Demonstrable experience with Disaster Recovery planning, resiliency testing, and fault tolerant distributed system design.
  • Proficiency in deploying, managing, and troubleshooting applications on OpenShift/Kubernetes.
  • Hands on experience with modern observability tools (e.g., Prometheus, Grafana, Loki, Mimir, Tempo, AppDynamics).
  • Experience with Infrastructure as Code (IaC), configuration management, and automation tools (e.g., Ansible, Terraform).
  • Experience creating, modifying, and managing Helm charts for application deployment.
  • Significant professional experience in production management, software development, or an equivalent field, with a strong focus on Site Reliability Engineering.
  • Expertise in analyzing complex application, database, network, and OS issues within large scale, customer facing systems.
  • A service-oriented attitude combined with excellent problem-solving and strategic thinking skills.
  • Strong communication and diplomacy skills, with a proven ability to work effectively across multiple business and technical teams.

  • Desired Skills:

  • Experience with major public cloud providers (e.g., Google Cloud, AWS, Azure).
  • Proven experience delivering software and infrastructure using Agile frameworks.
  • Experience presenting technical strategy to senior and executive level audiences.
  • Experience writing or maintaining code in Java, Python, Go, or similar languages.
  • Education:
  • Bachelor’s/University degree, Master’s degree preferred

------------------------------------------------------

Job Family Group:

Technology

------------------------------------------------------

Job Family:

Applications Support

------------------------------------------------------

Time Type:

Full time

------------------------------------------------------

Most Relevant Skills

Please see the requirements listed above.

------------------------------------------------------

Other Relevant Skills

For complementary skills, please see above and/or contact the recruiter.

------------------------------------------------------

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

 

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View Citi’s EEO Policy Statement and the Know Your Rights poster.

Site Reliability Engineer- Vice President at Citigroup | Aplyr