Staff Site Reliability Engineer
Confirmed live in the last 24 hours
Thrive Market
Compensation
$180,000 - $225,000/year
Job Description
THE ROLE
We’re looking for a Staff Site Reliability Engineer to help define and build the reliability foundation for Thrive Market’s platform. You’ll be working with a first-class group of engineers to establish our SRE practice from the ground up; defining SLOs, SLIs and Error Budgets, building observability into everything we do, and creating the frameworks that ensure our systems scale reliably during our company’s rapid growth.
This is a high-impact role at an exciting inflection point. We’ve recently containerized our entire platform on Kubernetes, and we’re evaluating a potential platform migration to a next-generation ecommerce platform. You’ll be balancing hands-on reliability work with the strategic thinking needed to build systems that self-heal and get better over time.
If you’ve read books like The Google SRE Handbook, The Phoenix Project, Accelerate, The DevOps Handbook, etc., this is the right place for you!
RESPONSIBILITIES
Reliability & Observability
- Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services
- Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms
- Establish error budgets and use them to balance feature velocity with reliability investments
- Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence
- Design and implement chaos engineering practices to proactively identify failure modes before they impact members
Infrastructure & Platform
- Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency
- Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations
- Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation
- Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities
- Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives
- Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices
Culture & Process
- Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams
- Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions
- Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks
- Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout
- Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement
QUALIFICATIONS
Required
- B.S. in Computer Science or equivalent professional
Similar Jobs
Amazon Commercial Services Pty Ltd
Operations Supervisor, New Site
Thrive Market
Site Director, Fulfillment
Cerebras Systems
Senior Technical Program Manager – AI Infrastructure, Site Operations
Apple
AIML Data Operations - Site Project Representative
Apple
AIML Data Operations - Site Project Representative
Afaq - Warehouse Branch - J02