Back to Search
Overview
Mid-Level

Director, Data Engineering

Confirmed live in the last 24 hours

CZ Biohub

CZ Biohub

Compensation

$323,000 - $444,400/year

New York, NY (Hybrid); Redwood City, CA (Hybrid)
Hybrid
Posted April 18, 2026

Job Description

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere.

The Team

Our AI research team sits at the heart of our mission to unlock new dimensions of biological understanding. You will leverage state-of-the-art AI to accelerate discovery and drive transformative insights in biology — developing novel AI models purpose-built for biological research, engineering robust systems that enable breakthrough science at unprecedented scale, and translating these advances into practical tools that empower researchers worldwide.

Our approach is comprehensive and integrated, bringing together world-class AI model development, exceptional engineering talent, high-quality biological data, powerful computing infrastructure, and strategic partnerships. Success requires excellence across five interconnected pillars: training frontier AI models specifically for biology; building engineering systems that maximize research velocity and efficiency; executing a sophisticated data strategy that fuels AI development; operating a world-class AI compute platform; and creating impactful products that transform AI capabilities into accessible scientific tools.

The Opportunity

This role will lead Data Engineering, which builds the infrastructure that makes our biological foundation models possible—ingesting data from public repositories, merging it with large-scale internal data generation projects, transforming heterogeneous biological formats into AI-ready datasets, and delivering petabytes of training data to researchers pushing the boundaries of what's possible in biological AI. You'll set technical direction, hire, grow and develop engineers, and ensure we deliver reliable, scalable systems that keep pace with our ambitions. The software your team builds will directly shape what our models can learn.

This is a player-coach role. You'll spend meaningful time on technical leadership—architecture decisions, code review, unblocking hard problems—while also building and managing a high-performing team. We're a small organization with significant resources and long time horizons that values small, high functioning teams. We use AI tools aggressively, care deeply about code quality and operational reliability, and want leaders who understand why biology matters.

If you want to build and lead a team at the intersection of large-scale infrastructure and frontier science, with real autonomy and the chance to shape something genuinely new, we'd like to talk.

What You'll Do

  • Lead a team of data engineers, setting technical direction and ensuring delivery of reliable, scalable data infrastructure
  • Drive architecture decisions for petabyte-scale pipelines deployed across multiple compute environments (cloud, on-prem) that ingest, transform, and deliver genomic and imaging data for model training
  • Build a culture of operational excellence—99%+ pipeline reliability, strong observability, and systems that scale without manual intervention
  • Recruit, develop, and retain exceptional engineers who combine scale infrastructure experience with biological intuition
  • Partner with AI Research, Data Science, and Scientific Data Strategy to translate model and data requirements into engineering priorities 

What You'll Bring

  • 10+ years of experience leading data engineering or infrastructure teams through periods of growth and technical complexity, with 5 or more as a people manager
  • Track record of building AI training data pipelines at petabyte scale with high reliability requirements
  • Strong technical foundations—you can go deep on architecture, review code, and unblock hard problems
  • Experience hiring and developing engi
goaidataproduct