About the role

NVIDIA's Infrastructure Specialists team is hiring a Senior Solutions Architect - AI Factory Observability & Visualization! This remote role develops full-spectrum visibility that supports the smooth functioning of HPC systems and AI factories, transforming intricate telemetry across network and compute into straightforward, actionable perspectives.

The role has a complete, end-to-end understanding of the HPC/AI system, running and interpreting microbenchmarks and workloads to confirm system readiness, then establishing the observability that maintains this state. The work involves collaborating across NVIDIA teams to help partners see, understand, and respond to HPC system and AI factory performance, from hardware to workload.

What You Will be Doing:

Run AI factory validation tools, microbenchmarks, and workloads provided by the team, and interpret results to assess system health and performance.
Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
Establish what "healthy" represents across the stack — the metrics, logs, and signals that confirm a system is functioning well, and the thresholds that show it isn't.
Build and extend the telemetry surface across hardware, fabric, and workload, crafting how data is collected, transformed, stored, and surfaced.
Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
Develop automation (Python, Shell) for collecting, transforming, and presenting system and network data.
Recommend improvements to system visibility, data sources, and reporting that give teams clearer insight.
Collaborate with hardware, software, networking, datacenter, and product groups to ready HPC systems and AI factories for customer deployment, contributing documentation and readiness materials throughout the process.

What We Need to See:

Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
Practical experience working with observability systems (e.g., Prometheus, Grafana, Loki, or similar), including building custom exporters or collectors, setting up alerts, and handling metric cardinality and retention on a large scale.
Experience transforming metrics, logs, and traces into clear, actionable insight for complex distributed environments.
Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, InfiniBand/Ethernet fabric counters) and using it to diagnose performance regressions.
Strong communication skills and the ability to work effectively with cross-functional teams.

Ways to Stand Out From the Crowd:

Experience with AI factory or large-scale AI infrastructure build, deployment, or operations.
Background in HPC systems engineering, SRE, or systems analysis for GPU-accelerated environments.
Experience building automation and data pipelines that feed dashboards and reporting at scale.
Demonstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 28, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Skills & Tags

Aplyr's read

NVIDIA is a pioneering force in GPUs and AI, attracting top talent in engineering and innovation-driven roles across various tech domains.
Synthesized from recent postings & public sources

What's promising

•NVIDIA leads the GPU market, crucial for gaming and AI applications.
•The company invests heavily in AI and deep learning, driving technological advancements.
•NVIDIA's strong market position offers stability and growth opportunities for employees.

What to watch

•High competition in the semiconductor industry can impact market share.
•Rapid technological changes require constant adaptation and learning.
•Intense workload and high expectations may affect work-life balance.

Why NVIDIA

•NVIDIA's GPUs are industry benchmarks in gaming and professional graphics.
•The company's AI research is at the forefront of deep learning innovation.
•NVIDIA's culture emphasizes cutting-edge technology and engineering excellence.

Aplyr’s read is generated by AI from public sources. Was it useful?

About NVIDIA

NVIDIA

nvidia.com

View company

NVDA$200.04-4.13%

NVIDIA is a leading technology company known for its graphics processing units (GPUs) for gaming and professional markets, as well as its advancements in artificial intelligence and deep learning.