Senior Software Engineer, Observability Insights
Confirmed live in the last 24 hours
CoreWeave
Compensation
$165,000 - $242,000/year
Job Description
About the role:
We are seeking senior engineers to lead our Observability Insights effort, building the product experiences and agentic interfaces that sit on top of our foundational telemetry layer. You will play a pivotal role in enabling CoreWeave and its customers to understand, troubleshoot, and optimize complex AI systems by delivering core building blocks like multi-tenant APIs, managed Grafana experiences, and MCP-based tool servers. You’ll collaborate closely with PMs and engineering leadership to shape the end-to-end observability experience, providing an outsize opportunity to influence how the world interacts with the forefront of Artificial Intelligence.
Some of what you’ll work on:
- Design and execute the development of highly available, multi-tenant APIs that expose telemetry and derived insights in an developer obsessed way.
- Modernize how users interact with data by building agentic experiences, including MCP servers, agentic tools and API gateways that safely expose foundational telemetry.
- Build agentic observability capabilities that will enable agentic workflows for guided debugging, workload optimization, and incident summarization to empower CoreWeavers and customers alike.
- Develop and enforce best practices regarding the health of telemetry data pipelines, specifically focused on correlation primitives and aggregation services for RCA and performance detection.
- Improve the performance, security, reliability, and scalability of insights services including SLO ownership and latency optimization while participating in the team’s on-call rotation.
- Collaborate closely with internal engineering teams, applying a platform-as-a-product mindset to understand their needs and embed observability best practices and custom tooling into their systems.
- Contribute to the overall observability strategy, influencing the direction of our platform
Who you are:
- Six or more years of experience in software or infrastructure engineering, with a focus on building production-grade backend systems and distributed APIs.
- You are customer obsessed, ecstatic to provide infrastructure as a service, and default to adopting a product lens when building developer-facing surfaces like SDKs and CLIs.
- Versed in reliability engineering concepts, including evaluation datasets for LLMs, error budgets for platform services, and fault-tolerant design for multi-tenant systems.
- Familiar with various observability systems like ClickHouse, Loki, Victoria Metrics, Prometheus, and Grafana.
- Experienced in building agentic applications or LLM features, with a pragmatic approach to grounding, tool calling, and operational safety.
- Comfortable with the idea of using Go as your primary programming language, but capable of collaborating with Python components when required for agentic layers.
- Work with a passionate team of engineers in an iterative, high-trust agile environment to ensure the collection-to-insights pipeline works end-to-end.
Preferred:
- Operated Kubernetes clusters at scale with experience of debugging real-world AI workloads.
- Experience with logging, tracing, and metrics platforms in production and at scale, with a deep u
Similar Jobs
Gusto
Staff Software Engineer, Observability
CoreWeave
Senior Software Engineer, Observability
Okta
Staff Site Reliability Engineer - Observability
Okta
Senior Site Reliability Engineer - Observability
Okta
Senior Site Reliability Engineer- Observability
Anthropic