Back
Verified active · 2h ago

Senior Software Engineer, AIOps

NVIDIANVIDIA·Semiconductors

Apply effort

<60 sec

via Aplyr Quick Apply

Posted

Today

01

About the role

NVIDIA is powering the world's most advanced AI Factories. To ensure their seamless operation, we are building a mission-critical Observability and Prediction platform - delivered as both a high-scale SaaS solution and a robust on-premises deployment for our largest enterprise customers.

We are looking for a Senior Software Engineer to join the AIOps platform team and help build the core distributed systems that ingest massive telemetry streams from GPU clusters and operationalize predictive AI models at scale. You will work at the intersection of high-performance data engineering and production ML, turning research algorithms into reliable, mission-critical software.

What you'll be doing:

  • Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions - powering real-time dashboards, automated root-cause analysis, and proactive incident response.

  • Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities, ensuring AI models are trained on high-quality, well-structured data that improves predictive accuracy and generalization.

  • High-Scale Engineering: Design distributed systems to handle the extreme telemetry density of large-scale AI clusters, ensuring efficient data ingestion, processing, and real-time analysis.

  • Instrument services with deep observability (metrics, logs, traces) to support rapid debugging and continuous performance improvement.

  • Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale - packaging, versioning, deploying, and monitoring AI models in both SaaS and on-premises environments.

  • Contribute to the platform's core libraries and abstractions that accelerate development across the broader AIOps engineering team.

What we need to see:

  • B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field.

  • 8+ years of software engineering experience building production distributed systems.

  • Core Systems Programming: Expert-level proficiency in languages such as Go, C++, or Rust, with a focus on high-performance, concurrent architectures.

  • Solid understanding of Kubernetes and container-based deployments for production services.

  • Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment.

  • Comfort working in ambiguous, fast-moving environments where the product is still being shaped.

Ways to stand out from the crowd:

  • Experience building ML model-serving platforms or MLOps tooling (model registries, A/B rollout frameworks, feature stores) at scale.

  • A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers.

  • A "Systems" Thinker: You don't just write software; you understand the full stack, from how data moves across the wire to how it’s processed in a distributed cluster.

  • Practical Innovation: The ability to simplify complex problems and build internal tools or frameworks that empower other engineering teams to move faster.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you are passionate about building mission-critical systems at the frontier of AI infrastructure, we want to hear from you.

Skills & Tags

02

Aplyr's read

NVIDIA is a pioneering force in GPUs and AI, attracting top talent in engineering and innovation-driven roles across various tech domains.

Synthesized from recent postings & public sources

What's promising

  • NVIDIA leads the GPU market, crucial for gaming and AI applications.
  • The company invests heavily in AI and deep learning, driving technological advancements.
  • NVIDIA's strong market position offers stability and growth opportunities for employees.

What to watch

  • High competition in the semiconductor industry can impact market share.
  • Rapid technological changes require constant adaptation and learning.
  • Intense workload and high expectations may affect work-life balance.

Why NVIDIA

  • NVIDIA's GPUs are industry benchmarks in gaming and professional graphics.
  • The company's AI research is at the forefront of deep learning innovation.
  • NVIDIA's culture emphasizes cutting-edge technology and engineering excellence.

Aplyr’s read is generated by AI from public sources. Was it useful?

03

About NVIDIA

NVDA$212.45+3.54%

NVIDIA is a leading technology company known for its graphics processing units (GPUs) for gaming and professional markets, as well as its advancements in artificial intelligence and deep learning.

04

Similar roles