Staff Site Reliability Engineer
Confirmed live in the last 24 hours
Zocdoc
Job Description
Our Mission
Healthcare should work for patients, but it doesn’t. In their time of need, they call down outdated insurance directories. Then wait on hold. Then wait weeks for the privilege of a visit. Then wait in a room solely designed for waiting. Then wait for a surprise bill. In any other consumer industry, the companies delivering such a poor customer experience would not survive. But in healthcare, patients lack market power. Which means they are expected to accept the unacceptable.
Zocdoc’s mission is to give power to the patient. To do that, we’ve built the leading healthcare marketplace that makes it easy to find and book in-person or virtual care in all 50 states, across +200 specialties and +12k insurance plans. By giving patients the ability to see and choose, we give them power. In doing so, we can make healthcare work like every other consumer sector, where businesses compete for customers, not the other way around. In time, this will drive quality up and prices down.
We’re 18 years old and the leader in our space, but we are still just getting started. If you like solving important, complex problems alongside deeply thoughtful, driven, and collaborative teammates, read on.
Your Impact on Our Mission:
As a Staff Site Reliability Engineer (SRE) at Zocdoc, you will shape how we operate safe, observable, and scalable systems across the company. You’ll lead initiatives that improve incident response, define reliability patterns, and drive organization-wide operational excellence—helping us build systems that fail gracefully, recover quickly, and scale efficiently.
You won’t just respond to incidents—you’ll help design the systems, tools, and practices that teams rely on to avoid them. Your work will clarify ownership, improve on-call quality, and strengthen our observability posture. By embedding best practices into how we build and run services, you’ll enable every engineering team at Zocdoc to move faster, safer, and with greater confidence.
You’ll thrive in this role if you…
- Stay composed and clear during incidents, and use them as catalysts for systemic improvement
- Treat observability as a strategic capability that enables better decisions, not just better dashboards
- Build scalable, default-safe patterns and tools that support resiliency and reliability
- Build strong cross-functional relationships and navigate complex systems to drive scalable, reliable outcomes
- Are endlessly curious—about how systems fail, how teams operate, and how to make both better
- Share knowledge generously and help others build with confidence and operational rigor
Your day to day is…
- Participate in and influence high-impact incident response efforts, contributing calm decision-making and retrospective-driven learning
- Define and evolve org-wide incident practices, retrospectives, and reliability tooling
- Architect and evolve observability platforms that offer actionable insight into system health, business-critical paths, and failure modes
- Lead the development of reliability and observability practices, including alerting hygiene, SLOs, and deployment safeguards
- Guide teams in building resilient, fault-tolerant services through consultative design, operational reviews, and safety-focused defaults
- Partner with Product, Platform, and Security teams to ensure new systems are operable and scalable from day one
- Design and implement internal tools that improve deployment safety, incident coordination, and production readiness
- Mentor engineers across teams in operational rigor, reliability principles, and system debugging
You’ll be successful in this role if you have…
- 8+ years of experience operating and scaling production infrastructure in cloud-native environments
- Deep expertise in incident response, debugging distributed systems, and driving reliability improvements
- Strong working knowledge of observability stacks (metrics, logs, traces), alerting strat
Similar Jobs
Bugcrowd
Staff Site Reliability Engineer
Okta
Staff Site Reliability Engineer - Kubernetes
Zocdoc
Staff Site Reliability Engineer
Ping Identity
Staff Site Reliability Engineer
Ping Identity
Senior Staff Site Reliability Engineer
Ping Identity