Senior Production Engineer, Tooling & Frameworks
Confirmed live in the last 24 hours
CoreWeave
Compensation
$139,000 - $204,000/year
Job Description
About the Role
Production Engineering ensures CoreWeave’s cloud runs with world-class reliability, performance, and operational excellence. Herd is our newest innovation: an agentic AI platform that serves as CoreWeave’s intelligent SRE assistant - combining AI reasoning, data infrastructure, and observability into an autonomous operational intelligence layer for internal use.
As a Production Engineer on Herd, you’ll define and build the systems that power a scalable agentic ecosystem. You’ll design distributed services and data pipelines that process, embed, and retrieve operational knowledge at scale, enabling LLM-powered agents to work alongside human engineers in production. This is a hands-on role at the intersection of AI operations, distributed systems, and data infrastructure.
What You’ll Do
- Architect and build large-scale distributed systems that power AI SRE Platforms.
- Design data infrastructure for AI reasoning (embedding generation, context retrieval, vector stores) optimized for real-time operational queries.
- Build agent orchestration and lifecycle components so agents can communicate, delegate, and reason collectively across CoreWeave systems.
- Integrate AI SRE Platform with a large number of internal systems (Kubernetes, observability platforms, etc.) to enable end-to-end automation and insights.
- Lead architectural design discussions and set technical direction for AI-driven reliability systems.
- Partner across Production Engineering, Data Engineering, ML Infrastructure, and Platform to operate AI SRE as a high-availability platform embedded in critical reliability workflows.
- Develop services that interpret telemetry, detect anomalies, and generate RCA (root cause analysis) and PIR (post-incident review) artifacts; trigger automated mitigations where appropriate.
- Codify operational best practices into services, APIs, and Kubernetes-native components.
- Participate in an on-call rotation supporting the systems you build.
What You’ve Worked On (Minimum Qualifications)
- 5+ years in software or infrastructure engineering building and operating distributed systems at scale.
- pythongorustawskubernetesaiiosdataproductdesign
Similar Jobs
Danaher
Senior Manufacturing Engineer
Applied Materials
Module Process Engineer
Applied Materials
Manufacturing Engineer II (Electrical Focus) - (E2)
Takeda
【正社員】大阪工場 注射剤製造担当者(充填グループ)/Operator, Sterile Production, Osaka Plant
Johnson & Johnson
Production Associate I m/w/d
Caterpillar