Senior Manager, Site Reliability Engineering (SRE)

Confirmed live in the last 24 hours

SolarWinds

Bangalore, India

On-site

Posted April 5, 2026

Job Description

At SolarWinds, we’re a people-first company. Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.

The ideal candidate thrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic. We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment. Solarians are ready to advance our world-class solutions in a fast-paced environment and accept the challenge to lead with purpose. If you’re looking to build your career with an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us!

Role Overview:

SolarWinds is looking for a Senior Manager, Site Reliability Engineering (SRE) to lead reliability, scalability, and operational excellence for large-scale, cloud-native, data-intensive SaaS platforms.

This role combines people leadership, technical depth, and operational ownership. You will manage and grow SRE teams responsible for production systems while remaining close to platform architecture, reliability engineering, incident response, and automation strategy.

The ideal candidate has operated distributed systems in production environments and is comfortable guiding teams through complex troubleshooting, reliability improvements, and architectural decisions. This role requires balancing availability, performance, operational efficiency, and engineering velocity across large-scale SaaS services.

Responsibilities:

Lead and mentor SRE teams responsible for the reliability, availability, and performance of production SaaS platforms
Own and drive production reliability outcomes, including uptime, latency, scalability, capacity planning, and operational readiness
Oversee data-intensive distributed systems, including technologies such as ClickHouse, Kafka, ZooKeeper, MySQL, Redis, and Flink
Guide and review Kubernetes platform operations at scale, including cluster lifecycle management, upgrades, troubleshooting, and capacity planning
Establish and evolve SRE practices, including SLIs/SLOs, alerting strategies, incident management, and post-incident reviews
Lead and participate in production incident response, guiding teams through debugging, root cause analysis, and long-term remediation