Staff Site Reliability Engineer

Confirmed live in the last 24 hours

Zocdoc

Silicon Valley, CA

Hybrid

Posted April 1, 2026

Job Description

Our Mission

Healthcare should work for patients, but it doesn’t. In their time of need, they call down outdated insurance directories. Then wait on hold. Then wait weeks for the privilege of a visit. Then wait in a room solely designed for waiting. Then wait for a surprise bill. In any other consumer industry, the companies delivering such a poor customer experience would not survive. But in healthcare, patients lack market power. Which means they are expected to accept the unacceptable.

Zocdoc’s mission is to give power to the patient. To do that, we’ve built the leading healthcare marketplace that makes it easy to find and book in-person or virtual care in all 50 states, across +200 specialties and +12k insurance plans. By giving patients the ability to see and choose, we give them power. In doing so, we can make healthcare work like every other consumer sector, where businesses compete for customers, not the other way around. In time, this will drive quality up and prices down.

We’re 18 years old and the leader in our space, but we are still just getting started. If you like solving important, complex problems alongside deeply thoughtful, driven, and collaborative teammates, read on.

Your Impact on Our Mission:

As a Staff Site Reliability Engineer (SRE) at Zocdoc, you will shape how we operate safe, observable, and scalable systems across the company. You’ll lead initiatives that improve incident response, define reliability patterns, and drive organization-wide operational excellence—helping us build systems that fail gracefully, recover quickly, and scale efficiently.

You won’t just respond to incidents—you’ll help design the systems, tools, and practices that teams rely on to avoid them. Your work will clarify ownership, improve on-call quality, and strengthen our observability posture. By embedding best practices into how we build and run services, you’ll enable every engineering team at Zocdoc to move faster, safer, and with greater confidence.

You’ll thrive in this role if you…

Stay composed and clear during incidents, and use them as catalysts for systemic improvement
Treat observability as a strategic capability that enables better decisions, not just better dashboards
Build scalable, default-safe patterns and tools that support resiliency and reliability
Build strong cross-functional relationships and navigate complex systems to drive scalable, reliable outcomes
Are endlessly curious—about how systems fail, how teams operate, and how to make both better
Share knowledge generously and help others build with confidence and operational rigor

Your day to day is…

Participate in and influence high-impact incident response efforts, contributing calm decision-making and retrospective-driven learning
Define and evolve org-wide incident practices, retrospectives, and reliability tooling
Architect and evolve observability platforms that offer actionable insight into system health, business-critical paths, and failure modes
Lead the development of reliability and observability practices, including alerting hygiene, SLOs, and deployment safeguards
Guide teams in building resilient, fault-tolerant services through consultative design, operational reviews, and safety-focused defaults
Partner with Product, Platform, and Security teams to ensure new systems are operable and scalable from day one
Design and implement internal tools that improve deployment safety, incident coordination, and production readiness
Mentor engineers across teams in operational rigor, reliability principles, and system debugging

You’ll be successful in this role if you have…

8+ years of experience operating and scaling production infrastructure in cloud-native environments
Deep expertise in incident response, debugging distributed systems, and driving reliability improvements
Strong working knowledge of observability stacks (metrics, logs, traces), alerting strat

goawsaidataproductdesign