VP of Product, Research and Training Infrastructure
Confirmed live in the last 24 hours
CoreWeave
Job Description
The Mission
As CoreWeave continues to solidify its position as the Essential Cloud for AI, we are seeking a visionary VP of Research Training Infrastructure. This executive leader will own the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world. You will bridge the gap between "the metal" and the researcher, delivering a seamless, high-performance environment where frontier models are born.
The Role: Architect of the AI Factory
You will lead the product strategy of our Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training. This is a mission-critical role at the intersection of high-performance computing (HPC) and cloud-native agility.
Core Responsibilities
- Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.
- Holistic Training Services: Beyond Slurm, drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.
- Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.
- Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their "future-state" requirements into actionable product roadmaps.
Requirements: Deep Research & Infrastructure Mastery
- Proven Leadership: 15+ years of experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.
- Domain Expertise: Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.
- Research Mindset: You likely come from a background supporting frontier model research (pre-training and post-training) and understand the "pain points" of a research scientist.
- Scaling Experience: A track record of delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).
- Strategic Vision: Ability to define "what’s next" in the AI stack, from automated RL loops to specialized sandbox environments.
Why CoreWeave?
In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.
- Silicon-Up Innovation: Work directly with the latest NVID
Similar Jobs
Anthropic
Research Lead, Training Insights
Anthropic
[Expression of Interest] Research Manager, Interpretability
Anthropic
Technical Program Manager, Research
Anthropic
Communications Manager, Research
AlphaSense
Sector Lead, TMT (EMEA), Channel & Customer Research
AlphaSense