Sr. Site Reliability Engineer I
Confirmed live in the last 24 hours
DoubleVerify
Job Description
Hybrid (3 days per week in office)
Who We Are
DV is the leader in digital performance solutions, helping our advertiser and agency partners Verify the quality of their digital campaigns, Optimise to improve performance and Prove that they’re achieving their business outcomes, through unbiased 3rd party data and analytics. DV’s mission is to be the definitive source of transparency and data-driven insights into the quality and effectiveness of digital advertising for the world’s largest brands, agencies, publishers, and digital ad platforms. Since 2008, DV has helped hundreds of Fortune 500 companies gain the most from their media spend by delivering best-in-class solutions across the digital advertising ecosystem, helping to build a better industry. Learn more at www.doubleverify.com.
What You’ll Do
- Build and maintain the reliability, scalability, and performance of our digital media measurement platforms
- Implement observability best practices, including metrics collection, dashboarding, and alerting strategies that support proactive reliability improvements
- Reduce MTTR for critical incidents through automation, improved observability, and proactive monitoring
- Respond to incidents and drive them to resolution, managing Sev1/Sev2 situations
- Monitor and maintain high availability infrastructure and services across GCP, AWS, OCI, and on-premises environments
- Lead technical projects from planning through deployment, ensuring proper stakeholder communication and team enablement.
- Build and deploy automations to eliminate operational toil and improve efficiency across deployment workflows, validation scripts, and self-service capabilities
- Leverage AI-assisted development tools to accelerate automation development and problem resolution
Build custom integrations and MCP servers for monitoring platforms to enable programmatic access and AI-driven analysis - Implement Infrastructure-as-Code using Terraform, Helm charts, Python and scripts, and configuration management tools to ensure repeatable, version-controlled infrastructure deployments
Develop production automations for routine operational tasks, reducing manual intervention and accelerating task completion - Create and maintain documentation, runbooks, and SOPs in Confluence to ensure consistent incident response across the team
- Participate in on-call rotations and post-incident reviews to minimize downtime and prevent recurrence
Required Experience & Skills
- 4+ years in Site Reliability Engineering, DevOps, or related operational roles with proven experience in Linux/Unix systems administration
proficiency in scripting and programming languages such as Python, Bash, or Go for automation and tool development - Strong experience with cloud infrastructure and services across GCP, AWS, and OCI, as well as container orchestration tools like Kubernetes
- Expertise in monitoring and observability tools such as Prometheus, Grafana, Splunk, Nagios,
- Hands-on experience with Infrastructure-as-Code tools like Terraform, Ansible, or Helm
- Proven ability to develop and track SLIs, SLOs, and SLAs to drive reliability improvements
Technical Knowledge
- Deep understanding of networking, DNS, load balancing, and CDN technologies
- Familiarity with databases (SQL, NoSQL, Vertica, MongoDB, Snowflake) and data pipeline technologies
- Knowledge of CI/CD pipelines, GitLab, and deployment automation
- Experience with workflow automation platforms is a strong plus
Soft Skills & Mindset
- Exceptional communication skills with the ability to collaborate across teams and explain technical concepts clearly
- Proa
Similar Jobs
Okta
Staff Site Reliability Engineer, Security- GCP
Okta
Staff Site Reliability Engineer- Splunk Expert
Okta
Staff Site Reliability Engineer - Observability
Okta
Senior Site Reliability Engineer - Observability
MongoDB
Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)
MongoDB