Senior Site Reliability Engineer - Observability

Confirmed live in the last 24 hours

Okta

Compensation

$147,000 - $202,000/year

Bellevue, Washington

Hybrid

Posted March 23, 2026

Job Description

Secure Every Identity, from AI to Human

Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

Position Overview:

We are seeking a highly technical Senior Observability Site Reliability Engineer with a specialty in Splunk to own and evolve our Splunk ecosystem. In this role, you will move beyond simple monitoring to delivering a world class, comprehensive, scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure as code—utilizing Terraform and strong coding proficiency in Go, Python, or Ruby—to automate the deployment of agents and collectors across complex distributed systems.

Key Responsibilities

Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
Splunk Engineering: Optimize the collection, processing, and storage of log data to ensure high reliability and low latency of our Splunk services
Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
Automation: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.

Required Skills & Experience (The Essentials)

Log Management: Minimum 5+ Experience scaling and managing Splunk Cloud at scale (1000+ SVCs), including Workload Management (WLM) and HEC optimization. Visualization: Expertise in creating intuitive, actionable Splunk dashboards that correlate data across multiple sources.
SRE Mindset: Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.

Programming Proficiency: Strong coding skills in SPL, Go for building internal tools and automating workflows.
Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).
Problem Solving: A data-driven approach to debugging complex, cross-service performance bottlenecks.

Bonus Skills (The "Nice-to-Haves")

Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
Charge-back app: Experience in implementing Splunk charge-back app for usage reporting

Cloud Platforms: Experience managing observability native tools within AWS or GCP.

Additional requirements:

This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.
This person must attend in person onboarding in our San Francisco office the first week of employment.

#LI-MM

#LI-Hybrid
P14596_3372199

Below is the annual base salary range for candidates located in California (excluding San Francisco Bay Area), Colorado, Il

pythongorustawsgcpkubernetesmachine learningaidevopsdata