Operations Engineer, HPC Networking
Confirmed live in the last 24 hours
CoreWeave
Compensation
$110,000 - $179,000/year
Job Description
What You’ll Do:
In this role, you will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance. The ideal candidate will have a strong operations mindset, effective collaboration skills, and the ability to solve complex issues in a dynamic environment.
- Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
- Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.
- Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.
- Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
- Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.
Investing in our people is one of our top priorities, and we value candidates who can bring their diversified experiences to our teams. Here are some qualities we’ve found compatible with our team. We'd love to talk about whether this aligns with your experience and Interests and what you’re excited to work on next.
Who You Are:
Minimum Qualifications
- At least 1 year of experience with InfiniBand or similar networking technologies.
- Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting.
- Experience with Linux system administration and maintenance.
- Proficiency in at least one scripting language
Preferred Qualifications
- Hands-on experience with Nvidia UFM or similar fabric management tools.
- Familiarity with SLURM job scheduler and its role in HPC environments.
- Experience with monitoring and visualization platforms such as Grafana or Prometheus.
- Experience with operational tooling and automation frameworks like Ansible.
- Knowledge of data center operations, including server racks, and cabling.
- Python or Bash scripting.
Why CoreWeave?
At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:
- Be Curious at Your Core
- Act Like an Owner
- Empower Employees
- Deliver Best-in-Class Client Experiences
- Achieve More Together
We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization's growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!