Principal, Customer Reliability Engineer
Confirmed live in the last 24 hours
Crusoe
Compensation
$230,000 - $280,000/year
Job Description
Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.
We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.
We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.
If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.
About the Role:
As a Principal Customer Reliability Engineer, you define and elevate the technical reliability strategy of Crusoe Cloud at the company level.
You are an organization-wide authority in distributed systems, AI/ML infrastructure, networking, storage, compute, k8, and cloud operations. Your impact extends beyond CX, you shape how Crusoe designs, deploys, and scales high-performance GPU infrastructure.
This is not an escalation engineer role. This is a systems architect and reliability strategist role with direct impact on enterprise readiness and revenue protection.
What You'll Be Working On:
Org-Level Reliability Strategy
Define the technical vision for AI/ML workload reliability.
Architect guardrails across compute, storage, networking, and orchestration.
Partner with Product & Engineering to influence roadmap decisions impacting scalability and resilience.
Incident & Risk Governance
Lead post-incident structural reforms for major outages.
Define enterprise-grade incident management standards.
Establish reliability metrics that align with ARR protection and expansion.
Advanced Systems Architecture
Evaluate and improve:
Kubernetes multi-cluster design
Software-defined networking
IB fabric architecture
GPU lifecycle management
Observability frameworks
Drive automation-first operational maturity.
Executive & External Credibility
Serve as technical spokesperson during high-severity events.
Build enterprise confidence in Crusoe’s technical depth.
Contribute to technical thought leadership (blogs, architecture reviews, customer briefings).
Talent Multiplier
Mentor Sr. Staff engineers.
Raise hiring bar for advanced infrastructure roles.
Create technical learning frameworks for HPC & AI operations.
Work on tooling and automation for the CX team
Engage with customers during their onboarding phase
Work on Executive level escalations and high priority incidents
gokubernetesaidevopsdataproductdesign
Similar Jobs
FourKites
Principal Customer Success Manager
Toast
Principal Customer Success Analyst - AI and Advanced Customer Analytics
OneTrust
Principal Customer Success Manager
SpaceX
Principal Electrical Design Engineer, Gateways & Customer Hardware (Starlink)
Klaviyo
Principal Customer Success Manager
Klaviyo