Back
Verified active · 13h ago

Principal Software Engineer, Rack-Scale System Software — CSP Engagements

NVIDIANVIDIA·Semiconductors

Compensation

$272,000 - $431,250 USD

Apply effort

<60 sec

via Aplyr Quick Apply

Posted

Today

01

About the role

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system SW/FW, working with CSP engineering teams to ensure they can deploy, monitor, and operate these systems reliably at fleet scale. In this role, you will collaborate with NVIDIA's cross-functional rack-scale system SW/FW engineering teams with dedicated CSP-facing technical leadership. Your focus is on the system-level software that manages, monitors, and recovers the rack as a whole — fabric management, GPU/NVSwitch error handling and recovery, health telemetry APIs, firmware update orchestration, and SW-driven serviceability. You will drive work streams with CSP engineering teams to build shared understanding of the architecture, incorporate their operational feedback, and ensure integration readiness.

What you'll be doing:

  • Drive rack-scale SW/FW architecture alignment across CSP engagements — including fabric management software, link health monitoring, GPU/NVSwitch error handling, SW/FW serviceability features (e.g., hot-plug support, component isolation, firmware-driven recovery), and multi-component firmware orchestration

  • Drive technical work streams with CSP engineering teams on rack-scale system software — ensuring they deeply understand fabric management, NVSwitch behavior, error handling and recovery policies, health telemetry APIs, and SW/FW-controlled recovery operation

  • Capture and synthesize CSP engineering feedback on rack-scale system software — health monitoring APIs, SW-driven serviceability workflows, firmware update orchestration, and error recovery behavior — champion that feedback into NVIDIA's architecture decisions

  • Collaborate with multi-functional teams to ensure customer operational requirements are reflected in system software and firmware development

  • Identify cross-CSP patterns in rack-scale SW/FW issues, error handling behavior, and system configuration practices — drive documentation, tooling, and test strategy improvements as a result

  • Collaborate with execution teams on left-shift strategy — ensuring customer-side SW/FW integration work is identified early and completed ahead of hardware availability

  • Make critical technical decisions on rack-scale system SW/FW tradeoffs and mitigate execution risks through early engagement with CSP engineering teams

What we need to see:

  • 15+ years of experience in system software, platform firmware, or large-scale distributed systems engineering. BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)

  • Deep understanding of rack-scale system software challenges: multi-component coordination, error propagation, health monitoring, and serviceability / reliability

  • Experience with fabric management software, cluster management, or system-level orchestration frameworks. Familiarity with firmware architectures and update lifecycle management (multi-component update sequencing, rollback, recovery)

  • Understanding of error handling and recovery design patterns in distributed systems — fault isolation, retry policies, graceful degradation

  • Experience with health monitoring and telemetry systems: health scoring, event correlation, API design for fleet-level observability

  • Understanding of GPU or accelerator system software (drivers, device management, power management) is a strong plus

  • Customer obsession — genuine passion for understanding how CSPs operate sophisticated systems at fleet scale and simplifying their experience

  • Proven success providing technical leadership across organizational boundaries and influencing system software design without direct authority. Strong communication — ability to translate complex system software architecture into actionable mentorship for customer engineering teams

Ways to stand out from the crowd:

  • Experience with NVIDIA NVSwitch, NVOS, or GPU fabric management software

  • Background in system software for large-scale clusters at a hyperscaler (cluster management, fleet orchestration, health platforms)

  • Experience crafting error handling and recovery frameworks for multi-component systems (hundreds or thousands of coordinating devices)

  • Familiarity with GPU or accelerator fleet operations — driver lifecycle, firmware rollout strategies, health-based scheduling

  • Understanding of how system software decisions impact serviceability, availability, and operational cost at fleet scale

NVIDIA’s invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and establish teams with the most thoughtful people in the world. Are you ready to change the next generation of computing? Join us at the forefront of technological advancement.

NVIDIA data center systems, such as DGX and HGX, have become core to NVIDIA's rapidly growing enterprise and cloud provider businesses. These platforms bring together the full power of NVIDIA GPUs, NVIDIA NVLink, NVIDIA InfiniBand networking, NVIDIA Grace CPUs, and a fully optimized NVIDIA AI and HPC software stack.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 30, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

02

Aplyr's read

NVIDIA is a pioneering force in GPUs and AI, attracting top talent in engineering and innovation-driven roles across various tech domains.

Synthesized from recent postings & public sources

What's promising

  • NVIDIA leads the GPU market, crucial for gaming and AI applications.
  • The company invests heavily in AI and deep learning, driving technological advancements.
  • NVIDIA's strong market position offers stability and growth opportunities for employees.

What to watch

  • High competition in the semiconductor industry can impact market share.
  • Rapid technological changes require constant adaptation and learning.
  • Intense workload and high expectations may affect work-life balance.

Why NVIDIA

  • NVIDIA's GPUs are industry benchmarks in gaming and professional graphics.
  • The company's AI research is at the forefront of deep learning innovation.
  • NVIDIA's culture emphasizes cutting-edge technology and engineering excellence.

Aplyr’s read is generated by AI from public sources. Was it useful?

03

About NVIDIA

NVDA$195.74-1.64%

NVIDIA is a leading technology company known for its graphics processing units (GPUs) for gaming and professional markets, as well as its advancements in artificial intelligence and deep learning.

04

Similar roles