Platform Observability - Senior Manager, Software Engineering
Confirmed live in the last 24 hours
Elastic
Compensation
$189,800 - $274,500/year
Job Description
Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale — unleashing the potential of businesses and people. The Elastic Search AI Platform, used by more than 50% of the Fortune 500, brings together the precision of search and the intelligence of AI to enable everyone to accelerate the results that matter. By taking advantage of all structured and unstructured data — securing and protecting private information more effectively — Elastic’s complete, cloud-based solutions for search, security, and observability help organizations deliver on the promise of AI.
What is The Role:
The Platform Observability Team provides critical, scalable, and efficient observability processes and tooling for Elastic's internal developers and engineering teams. We operate 200+ deployments for logs, metrics, and traces, including regional logging, metrics, and monitoring clusters. We build self-service tooling that empowers engineers to instrument, monitor, and troubleshoot their services independently. We're part of Platform Infrastructure, the organization responsible for providing the runtime for all Elastic Cloud-powered products. We are builders at heart!
Are you a leader who puts people first but still loves diving into the technical details of observability and distributed systems? We're looking for a Senior Engineering Manager to lead this globally distributed team of 7 SREs across EMEA, Americas, and APJ.
We need someone who can truly understand the "how" and "why" of observability platforms, ensuring our internal customers have the tools they need to build reliable services at scale. You'll own SLA monitoring infrastructure for Elastic Cloud (ESS and Serverless) and drive adoption of Elastic's own observability stack across the organization.
We value leaders who embrace our SRE culture: go slow to go fast, own problems end-to-end, make sound and timely decisions, and create amazing experiences for both internal and external customers.
What You Will Be Doing:
- People & Talent Management: You will mentor and lead a globally distributed team of SREs, fostering a culture of ownership, psychological safety, and continuous improvement. You'll be responsible for the full employee lifecycle, from hiring top SRE talent to helping team members reach their next promotion and creating clear career paths.
- Strategy & Execution: How do we turn observability needs into platform capabilities? You'll partner with engineering teams and Platform SRE leadership to understand requirements, build roadmaps, and translate them into clear deliverables. You'll drive adoption of observability best practices and ensure our platforms meet the needs of internal customers.
- Technical & Operational Leadership: We take full accountability for our platforms. You'll partner with your team to facilitate technical discussions, navigate trade-offs, and drive delivery of high-quality observability solutions. You'll champion reliability improvements, incident management processes, and blameless postmortems, keeping our platforms production-ready.
What You Bring:
- Management Experience: 3+ years leading technical teams, with a focus on mentoring and talent development.
- Technical Foundation: 5+ years in SRE, DevOps, or infrastructure engineering, with enough depth to understand the work and guide senior engineers through complex problems.
- Observability Expertise: Strong understanding of observability principles: metrics, logs, traces, and APM. You know what good looks like. Experience defining and tracking SLIs, SLOs, and error budgets.
- SaaS at Scale: Previous success supporting high-scale, multi-tenant global platforms.
- Distributed Systems Background: Experience operating and scaling systems in cloud environments (AWS, GCP, or Azure) and improving reliability.
- Distributed Leadership: Experience managing geographically distributed teams across multiple time zones and cultures.
- Operational Rigor: Experience with incident management, on-call processes, and postmortem practices.
- Infrastructure as Code: Familiarity with Kubernetes, Terraform, and GitOps practices.
- Communication: A knack for translating technical strategy and progress for all audiences, technical or not. Strong written communications are important here.
Bonus Points:
- Elastic Stack Ex
Similar Jobs
CVS Health
Lead Observability Platform Engineer
LangChain
Software Engineering Manager, AI Observability & Evals Platform (San Francisco, CA)
LangChain