Team Lead, Site Reliability Engineering
Confirmed live in the last 24 hours
Veeam Software
Job Description
Veeam is the Data and AI Trust Company, specializing in helping organizations ensure their data and AI are fully understood, secured, and resilient to enable the acceleration of safe AI at scale. As the market leader in both data resilience and data security posture management, Veeam is built for the convergence of identity, data, security, and AI risk. Headquartered in Seattle with offices in more than 30 countries, Veeam protects over 550,000 customers worldwide, who trust Veeam to keep their businesses running. Join us as we go fearlessly forward together, growing, learning, and making a real impact for some of the world’s biggest brands.
About the Role
Veeam is expanding its Site Reliability Engineering (SRE) organization to support Veaam services. As an SRE Team Leader, you will build and lead a high-performing team that partners with product, platform, and security engineering to make our systems reliable, scalable, and observable from the ground up. You’ll collaborate with peer engineering leaders to embed reliability into service roadmaps.
You’ll drive adoption of SRE principles (SLIs/SLOs/error budgets) and operate a healthy, daytime follow-the-sun on-call model in partnership with other regions. You will lead your team to make improvements in the overall operability, reliability, resilience, and security of the services we support.
What You’ll Do
People & Team Leadership
- Hire, onboard, and develop your SRE team
- Encourage culture that prioritizes learning and engineering over fault-finding and firefighting
- Ensure a sustainable operational coverage; monitor on-call health and workload
Reliability Strategy & Governance
- Establish and operationalize SLIs/SLOs and error budgets with service owners
- Run reliability reviews and hold teams accountable to outcomes
- Define reliability standards, runbooks, readiness checklists, and alerting patterns (including SLO-based alerting)
Operations & Incident Excellence
- Ensure incident response readiness
- Lead and coordinate major incidents
- Measure MTTR, change failure rate, SLO posture, and repeat-incident reduction
Engineering & Automation
- Lead software-first reliability investments: observability, resilience testing/chaos, and self-service guardrails
- Drive platform improvements and internal tools
What You’ll Bring
- 3+ years in managing Software, Platform, and/or Reliability Engineering
- Experience in IT Platform Engineering or Software Development
- Demonstrable experience leading engineering teams to predictably deliver outcomes
- Demonstrated success leading SLO/error-budget adoption and reliability programs for services
- Experience leading cross-functional initiatives collaboratively with peers through influence
- Experience with public clouds, Kubernetes, IaC, CI/CD, and observability
- gorustkubernetesaidataproduct
Similar Jobs
Warner Bros Discovery
Data Engineering Manager – (Data Core Processing Team) – Hyderabad
Omnicom Media Group UK
Account Director Client Team - VMO2
CACI International
Business Analyst (AI Enablement Team)
Dropbox
People Team Data Analyst, Talent Acquisition
Dropbox
People Team Data Analyst, Talent Acquisition
DEPT Agency