Manager, Site Reliability Engineering

Confirmed live in the last 24 hours

Veracode

Burlington, MA

On-site

Posted March 9, 2026

Job Description

Looking for an innovative, high-growth, multi-award-winning company in one of the hottest segments of the security market?  Look no further than Veracode!

Veracode is a global leader in Application Risk Management for the AI era. Powered by trillions of lines of code scans and a proprietary AI-generated remediation engine, the Veracode platform is trusted by organizations worldwide to build and maintain secure software from code creation to cloud deployment.

Learn more at www.veracode.com, on the Veracode blog, and on LinkedIn and Twitter.

We are seeking a skilled Manager, Site Reliability Engineering to lead the reliability, availability, and operational excellence of Veracode’s production systems.This role focuses on defining and enforcing reliability standards, managing risk in production, and ensuring services meet agreed-upon service levels under real-world load and failure conditions.

The ideal candidate has experience operating large-scale distributed systems in production, driving and implementing SLO-based reliability practices, and partnering with engineering, security, devops and product teams to improve the reliability of the system and developer velocity at the same time.

Key Aspects of Role

Lead 9 member global Site Reliability Engineering Team
Set objectives and key results, KPIs and manage team performance
Act as the primary point of accountability for reliability concerns that span multiple teams, including DevOps, Security, Database, and Product Engineering, driving alignment and resolution.
Manage team on-call schedule and act as point of escalation for alerts and production incidents
Create tickets, groom backlog and prioritize work in sprints
Utilize AWS services to design scalable cloud solutions that support critical systems.
Partner with software engineering teams to ensure monitoring and alerting is in place, enabling consistent, scalable, and automated service delivery.
Own the design and enforcement of the organization’s observability strategy, ensuring continuous improvements in reliability, standardization, and observability across the board.
Drive alert hygiene, standardization, and reduction of alert fatigue across the organization.
Lead efforts to automate infrastructure deployment and management using Terraform, Kubernetes, and other cloud-native tools.
Create automated incident response workflows to handle common infrastructure and application issues.
Collaborate with security teams to ensure systems adhere to industry-standard security practices and policies.
Document and train engineering teams on best practices in reliability, scalability, and operational excellence.
Design, operate, and continuously improve on-call and incident response processes to ensure sustainability, appropriate escalation, and reduction of operational toil.
Contribute to incident and process post-mortems.