Systems Engineer, Kernel (Networking)
Confirmed live in the last 24 hours
CoreWeave
Compensation
$153,000 - $242,000/year
Job Description
Senior Systems Engineer, Kernel Networking
CoreWeave is seeking a specialized Kernel Networking Engineer to join our HAVOCK Team. In this role, you will be the subject matter expert for the networking subsystem of CoreWeave’s Linux-based infrastructure. As we scale our massive AI/HPC clusters, you will focus on optimizing the datapath, tuning the TCP/IP and RDMA stacks, and ensuring the stability of high-throughput workloads across NVIDIA, Mellanox, and Broadcom hardware.
Hardware - Acceleration - Virtualization - Operating Systems - Containerization - Kubelet
Our Team’s Stack:
- Python, Go, bash/sh, C
- Custom Linux Kernel, Ubuntu
- Debug Tools: crash, kdump, drgn, gdb
- Prometheus, Victoria Metrics, Grafana, Loki
- Docker, kubernetes (k8s), KubeVirt
Focus Areas:
- Holistic Troubleshooting – Act as the first line of defense for complex system crashes, soft lockups, and kernel panics.
- Cross-Domain Debugging – Identify whether a root cause lies in memory management, storage, or the network layer.
- Incident Response – Reduce "Mean-Time-To-Resolution" by quickly analyzing crash dumps and stack traces.
- Reliability Engineering – Contribute to the "Smarter Triaging" initiative to automate crash analysis.
- Fleet Stability – Ensure kernel support across diverse hardware (CPUs, GPUs, DPUs).
Responsibilities:
- Analyze kernel crashes, oopses, and panics across the entire stack.
- Apply specific networking knowledge to troubleshoot issues with NVIDIA/Mellanox/Broadcom NICs.
- Utilize crash dump analysis (kdump, crash, drgn) to triage issues affecting customer workloads.
- Improve documentation and RCA processes for kernel failures.
- Assist in maintaining kernel builds and CI/CD pipelines to streamline testing.
Requirements:
- 5+ years of experience in systems-level development or kernel engineering.
- Broad Kernel Knowledge: Solid grasp of memory management, scheduling, and filesystems.
- Networking Fluency: Proven record troubleshooting RoCE, IB, and RDMA issues.
- Debugging Mastery: Expert capability with standard utilities and a systematic approach to root-cause analysis.
- Excellent verbal and written communication skills (ability to explain complex kernel bugs to stakeholders).
Nice-to-haves:
- Experience with eBPF for troubleshooting.
- Knowledge of GPU/NVLink architectures.
- Experience working with automated monitoring/alerting systems (Grafana, Jira automation).
- Willingness to present at conferences (LPC, LSFMMBPF).
Our compensation reflec
Similar Jobs
Linxon
Manager / Senior Manager - Network
Linxon
Manager / Senior Manager – End User support
Linxon
Lead / Manager - Service desk Lead
Linxon
Lead – Enterprise Asset and SAM
Linxon
Manager - Enterprise Service Management
Linxon