HPC Performance Engineer
Confirmed live in the last 24 hours
CoreWeave
Compensation
$165,000 - $242,000/year
Job Description
What you'll do:
CoreWeave is seeking a highly skilled and motivated HPC Performance Engineer to join our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role, you will play a crucial part in the design, development, and optimization of our bare-metal systems from POST through joining a Kubernetes cluster. The team’s primary responsibilities include maintaining a custom Linux kernel, various OS images (Ubuntu-based), the virtualization stack (kubevirt/qemu/vfio), and the container/pod runtime stack (containerd/nydus/kubelet). You will collaborate closely with cross-functional teams, up stack engineering teams, and stakeholders to ensure our low-level software stack is performant in the context of hardware updates; and providing data, metrics, dashboards, and analysis to substantiate performance assertions.
Kernel Hardware - Acceleration - Virtualization - Operating Systems - Containerization - Kubelet
Our Team’s Stack:
- Python, Go, bash/sh, C
- Prometheus, Victoria Metrics, Grafana
- Linux Kernel (custom build), Ubuntu
- Intel/AMD/ARM CPUs, Nvidia GPUs, DPUs, Infiniband and Ethernet NICs
- Docker, kubernetes (k8s), KubeVirt, containerd, kubelet
About the role:
- Develop and maintain tools for establishing systems performance baselines
- Develop and maintain performance regression analysis testing automation
- Design and maintain performance regression test pipelines for HPC workloads
- Debug and Tune fabric-level performance to ensure low-latency high throughput configurations
- Development of telemetry for performance analysis across distributed clusters of servers
- Triage and fix performance issues in Linux
- Collect data, produce metrics and visualizations that communicate performance information compared to benchmarks; this data should lead to appropriate business decisions and toward greater automation that improves customer experience in relation to performance
- Define Linux and OS requirements, specifications, and system architecture in relation to systems performance, in collaboration with cross-functional teams. Along with these responsibilities there will also be cross team collaboration to triage and resolve bottlenecks
Who you are:
- 5+ years of professional experience in Systems/HPC Performance Engineering, Benchmarking, and/or Validation.
- Strong experience with MPI workloads and distributed system performance analysis
- Familiarity with RoCE, InfiniBand, and GPUDirect/Data Direct I/O, NUMA, etc in HPC workloads
- Hands-on use of public
Similar Jobs
NVIDIA
Senior HPC Performance Engineer
NVIDIA
Senior HPC and AI Networking Performance Research and Analysis Engineer
Dell Technologies
Senior GenAI & High Performance Computing (HPC) Delivery Engineer
SpaceX
High Performance Computing (HPC) Systems Engineer
Cadence Design Systems
High Performance Computing (HPC) Engineer
HPE