About the role
NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms.
This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team.
What You’ll Be Doing:
Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring.
Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation.
Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments.
Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration.
Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving.
Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges.
What We Need to See:
BS/MS degree in Computer Science or a related technical field (or equivalent experience).
8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role.
Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems.
Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms.
Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go.
Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations.
Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements.
Ways to Stand Out from the Crowd:
AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics.
Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs).
Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools.
SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications.
Skills & Tags
Aplyr's read
NVIDIA is a pioneering force in GPUs and AI, attracting top talent in engineering and innovation-driven roles across various tech domains.
What's promising
- •NVIDIA leads the GPU market, crucial for gaming and AI applications.
- •The company invests heavily in AI and deep learning, driving technological advancements.
- •NVIDIA's strong market position offers stability and growth opportunities for employees.
What to watch
- •High competition in the semiconductor industry can impact market share.
- •Rapid technological changes require constant adaptation and learning.
- •Intense workload and high expectations may affect work-life balance.
Why NVIDIA
- •NVIDIA's GPUs are industry benchmarks in gaming and professional graphics.
- •The company's AI research is at the forefront of deep learning innovation.
- •NVIDIA's culture emphasizes cutting-edge technology and engineering excellence.
Aplyr’s read is generated by AI from public sources. Was it useful?
About NVIDIA
NVIDIA is a leading technology company known for its graphics processing units (GPUs) for gaming and professional markets, as well as its advancements in artificial intelligence and deep learning.
Similar roles
AI / BI Intern – Factory Services (EMEA)
HP Inc.
Lead AI Software Engineer- AI Enablement
Mastercard
Trainee X.PLORE Finance (m/w/d)
Heidelberg Materials
Value Engineering Manager (Solutions Architect) Tokyo or Osaka
Tanium
Lead Software Engineer, Full Stack (Revenue Engine)
Charlie Health
Lead Software Engineer, Full Stack (Therapist Success)
Charlie Health