Senior Site Reliability Engineer — Government & Sovereign Cloud
Confirmed live in the last 24 hours
Veeam Software
Compensation
$109,800 - $252,500/year
Job Description
Veeam is the Data and AI Trust Company, specializing in helping organizations ensure their data and AI are fully understood, secured, and resilient to enable the acceleration of safe AI at scale. As the market leader in both data resilience and data security posture management, Veeam is built for the convergence of identity, data, security, and AI risk. Headquartered in Seattle with offices in more than 30 countries, Veeam protects over 550,000 customers worldwide, who trust Veeam to keep their businesses running. Join us as we go fearlessly forward together, growing, learning, and making a real impact for some of the world’s biggest brands.
About The Role
Veeam is building a global SRE function to support the Veeam Data Cloud, our new SaaS platform. This role focuses on our Government and Sovereign Cloud environment.
Due to clearance and access requirements, this team operates with restricted access to GOV infrastructure. That means you'll be part of a small team responsible for the full platform stack — including all VDC workloads. You won't always be able to hand off problems to other teams; you need to understand the entire architecture well enough to own it. You'll need to get up to speed on the platform quickly, often by reading code, docs, and architecture artifacts rather than getting direct access to environments from day one.
This is a ground-up role — you'll help define how reliability engineering works here by mapping systems, writing runbooks, setting baselines, and building the practices this team will run on going forward.
What You'll Do
Discovery & Documentation
-
Get up to speed on the full platform — all VDC workloads, dependencies, and risk areas. Much of this will happen through code, docs, and conversations rather than direct environment access.
-
Work with SMEs across the org to fill knowledge gaps and build onboarding material for the team.
-
Write and maintain runbooks, architecture docs, and operational guides.
Reliability & Incident Response
-
Design infrastructure for high availability and fault tolerance on Azure (including Azure Government).
-
Define SLIs, SLOs, and error budgets where none exist today.
-
Run incident response and blameless postmortems. Turn incidents into improvements.
-
Identify reliability risks across modern and legacy workloads and build practical remediation plans that work within compliance constraints.
Observability
-
Close observability gaps — define instrumentation requirements and drive implementation.
-
Set alerting, telemetry, and monitoring standards with partner teams.
-
Build automation to reduce toil and support fleet management.
-
Participate in on-call rotations.
Infras
Similar Jobs
Semtech
Senior Staff Engineer, DevOps
Semtech
Technical Lead – Wireless MCU Software Ecosystem
Western Union
Director, Software Engineering (Finance and Corporate Technology)
Nasdaq
Specialist Software Engineer
Dexcom
Sr Software QA Engineer
Warner Bros Discovery