Software Engineer, High Performance Computing

Confirmed live in the last 24 hours

Eventual

San Francisco

Remote

Posted September 22, 2025

Job Description

About Eventual

Every breakthrough Physical AI system — humanoid robots, autonomous vehicles, video generation models — is trained on petabytes of video, lidar, radar, and sensor data. But today's data platforms (Databricks, Snowflake) were built for spreadsheet-like analytics, not the multimodal corpora that power AI. Robotics and video-AI teams now lose 20-40% of their training time to dataloading alone. GPU bandwidth has grown 2-3× per generation. Storage and pipelines haven't. The gap widens every year.

Eventual was founded in 2022 to close it. Our open-source engine, Daft, is the distributed data engine purpose-built for multimodal AI — already running 2 PB/day at Amazon, 60-100 PB at another FAANG company, and in production at Mobileye, TogetherAI, and CloudKitchens. We are building a video-native index on top of our engine for Physical AI that streams curated datasets to GPUs at line rate. Saturates B200s today. Aimed at NVL72 and Vera Rubin tomorrow.

We're building this in partnership with the top PhysicalAI labs and public AI infrastructure companies today. We have raised $30M from Felicis, CRV, Microsoft M12, Citi, Essence, Y Combinator, Caffeinated Capital, Array.vc, and angels from the co-founders of Databricks and Perplexity. We've assembled a world-class team from AWS, Render, Pinecone and Tesla. We have spent our careers powering the last generation of PhysicalAI in self-driving, and are excited to now do this for the next.

Join our small (but powerful!) team working together 4 days/week in our SF Mission district office.

Your Role

As a Systems Engineer on the Dataloading team, you'll build the layer that turns multi-petabyte video corpora into dict[str, Tensor] already on the GPU at line rate. We work with the top labs training Physical AI on the newest generation hardware — H100, B200, GB200, NVL72, with Vera Rubin on the horizon — on billions of dollars worth of compute, in collaboration with partners that are the largest public AI companies on Earth. Our job is to keep those GPUs fed: rank-aware sampling, NVMe caching, video and sensor co-loading, random access into clips, decode pipelining. Streaming alone can already saturate a B200; the hard part is enabling the complex sampling patterns researchers actually need without giving up a single percentage point of MFU.

This is a systems engineering role for someone who feels physical pain when a system is slow. You won't need GPU experience on day one — we'll uplevel you on NVL72, CUDA, and SLURM. We will need you to bring real expertise on what happens between NVMe, network, memory, and CPU, and a deep instinct for where bytes go.

Key Responsibilities

Design and build the video-native dataloader: rank-aware, NVMe-cached, random-access into clips, returns tensors directly to the GPU.
Profile and optimize the full data path from object store → NVMe → page cache → host RAM → device RAM. Eliminate every avoidable copy and stall.
Saturate the latest hardware (B200, GB200, NVL72) on real customer training jobs. Push toward Vera Rubin bandwidth requirements.
Own performance benchmarks against customer baselines (custom DataLoaders, DALI, decord, LeRobot) and against our own historical numbers — regressions get caught at PR time.
Partner with researchers at our partner labs to land the loader in their training stack and measure MFU end-to-end.
Work cross-team with Storage Infrastructure on the index/format boundary and with Visual Understanding on the model-output ingestion path.

What we look for

Obsession with systems-level performance. You can recite Jeff Dean's "numbers every programmer should know" in your sleep. You eat flamegraphs for breakfast.
Strong opinions on io_uring — love it or hate it, you've earned the opinion.
Live and breathe Rust, C++, or C. You reach for them when it matters and you know why.
Strong familiarity with operating systems — page cache, scheduling, syscalls, NUMA, memory hierarchies.
A sense for where bytes actually go: NVMe vs. memory vs. network vs. PCIe vs. NVLink, and the throughput and latency budgets of each.

Nice to have

Experience working with GPUs is a plus, but you don't need it on day one.
Experience working with SLURM, Kubernetes for GPU workloads, or other HPC schedulers.
Hands-on CUDA experience.
Deep expertise on memory and caching subsystems — page cache tuning, hugepages, NUMA pinning, GPU-Direct Storage.
Worked on video decode pipelines (PyAV, decord, NVDEC) or PyTorch DataLoader internals.
Contributed to open-source systems projects in Rust/C++.

Perks & Benefits

In-person, tight-knit team — 4 days/week in our SF Mission office.
Competitive comp and meaningful startup equity.
Catered lunches and dinners for SF employees.
Commuter benefit.
Team-building events and poker nights.
Health, vision, and dental coverage.
Flexible PTO.
Latest Apple equipment.
401(k) plan with match.

If slow systems evoke emotional pain for you and you want to spend the next few years making the most expensive GPU clusters on the planet earn their keep, we'd love to talk.

gorustawskubernetesaimobiledataanalyticsproductdesign