GPU Performance Engineer
Confirmed live in the last 24 hours
CoreWeave
Job Description
What You’ll Do:
The GPU Performance Team is responsible for the automated testing and verification of CoreWeave’s ever-expanding fleet of nodes and infrastructure. The team plays a critical role in ensuring that the hardware we deliver to customers is stable and performant throughout its lifecycle.
About the role:
We are seeking a Senior Software Engineer to join the GPU Performance team to help us build and refine our automation platforms and workflows for reliability, efficiency, scalability and observability. This individual will join a team of Golang and Python developers innovating kubernetes-based solutions to hardware performance testing. As a member of the GPU Performance team, you would have the opportunity to:
- Design and Implement solutions for testing infrastructure at scale through automation.
- Develop platforms, workflows, data structures, API endpoints, and other testing management tools and capabilities.
- Create test plans, deployment automation, dashboards, alerts, and insights into our performance testing platform.
- Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.
Who You Are:
You are a software engineer with a strong bias toward reliability, automation, and scale. You’re comfortable owning services end‑to‑end and working close to hardware, but your primary tools are Go or Python, Kubernetes, and well‑designed APIs.
- Have experience building and maintaining backend services and APIs (REST/gRPC) and are comfortable integrating with infrastructure systems (Kubernetes, CI/CD, telemetry, metrics backends).
- Are motivated by automation at scale: you’d rather design a controller, workflow, or platform than run one‑off tests by hand.
- Are familiar with system metrics, performance, and health (e.g., GPU/CPU, network, storage, or similar infra) and know how to turn raw metrics into dashboards, alerts, and actionable insights.
- Are comfortable designing and implementing performance tests and validation workflows that run across many nodes and clusters.
- Care about code quality, observability, and operational readiness: logs, metrics, tracing, alerting, and on‑call are part of how you think about building systems.
- Are willing to participate in an on‑call rotation and help own the stability of the platforms you build.
- Work well in a collaborative, curious, and feedback‑driven team: you invest in your teammates, share ideas openly, and are excited to help shape how CoreWeave validates hardware performance at fleet scale.
Preferred:
- Write and maintain Kubernetes custom controllers and operators to automate infrastructure testing.
- Contributions to OSS projects such as llm-d, vLLM, or PyTorch
- Exposure to benchmarking large GPU fleets or multi-region clusters.
- Experience with CUDA kernels
Similar Jobs
CoreWeave
GPU Performance Software Engineer
CoreWeave
Sr GPU Performance Software Engineer
CoreWeave
GPU Performance Engineer
Apple
Apple Silicon GPU Driver Engineer - Performance
Anthropic
Performance Engineer, GPU
NVIDIA