About the role

NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms.

This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team.

What You’ll Be Doing:

Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring.
Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation.
Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments.
Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration.
Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving.
Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges.

What We Need to See:

BS/MS degree in Computer Science or a related technical field (or equivalent experience).
8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role.
Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems.
Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms.
Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go.
Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations.
Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements.

Ways to Stand Out from the Crowd:

AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics.
Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs).
Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools.
SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications.

Skills & Tags

Aplyr's read

NVIDIA is a pioneering force in GPUs and AI, attracting top talent in engineering and innovation-driven roles across various tech domains.
Synthesized from recent postings & public sources

What's promising

•NVIDIA leads the GPU market, crucial for gaming and AI applications.
•The company invests heavily in AI and deep learning, driving technological advancements.
•NVIDIA's strong market position offers stability and growth opportunities for employees.

What to watch

•High competition in the semiconductor industry can impact market share.
•Rapid technological changes require constant adaptation and learning.
•Intense workload and high expectations may affect work-life balance.

Why NVIDIA

•NVIDIA's GPUs are industry benchmarks in gaming and professional graphics.
•The company's AI research is at the forefront of deep learning innovation.
•NVIDIA's culture emphasizes cutting-edge technology and engineering excellence.

Aplyr’s read is generated by AI from public sources. Was it useful?

About NVIDIA

NVIDIA

nvidia.com

View company

NVIDIA is a leading technology company known for its graphics processing units (GPUs) for gaming and professional markets, as well as its advancements in artificial intelligence and deep learning.