About the role

Aplyr's Quick Take

This role is for a technical expert focused on managing and improving large-scale GPU computing clusters for AI training. You'll be responsible for ensuring the reliability and efficiency of the infrastructure that supports machine learning workloads, working closely with research and engineering teams.

Good fit

Ideal candidates have over 5 years of experience in infrastructure and systems engineering, particularly in distributed systems. A hands-on, problem-solving mindset and a strong understanding of Infrastructure as Code and CI/CD practices will help you thrive here.

Worth noting

The salary range is quite broad, indicating potential for negotiation based on experience. The role emphasizes a high level of technical expertise and ownership, which may not suit those looking for a more collaborative or less hands-on position.

About Hark

Hark is an artificial intelligence company building advanced, personalized intelligence. One that is proactive, multimodal, and capable of interacting with the world through speech, text, vision, and persistent memory.

We're pairing that intelligence with next-generation hardware to create a universal interface between humans and machines. While today's AI largely operates through chat boxes and decade-old devices, Hark is focused on what comes next: agentic systems that interact naturally with people and the real world.

To get there, we're developing multimodal models and next-generation AI hardware together - designed from the ground up as a single, unified interface for a new era of intelligent systems.

About the Role

We are looking for a Member of Technical Staff, Infrastructure Compute to lead and manage large-scale GPU computing clusters powering our AI training and deployment workloads. You'll work at the intersection of systems engineering and machine learning infrastructure, owning the reliability, scalability, and efficiency of the compute platform that our research and engineering teams depend on. This is a high-impact, highly technical role suited for someone who thrives in complex distributed systems environments and cares deeply about infrastructure as a product.

Responsibilities

Design, implement, and maintain Infrastructure as Code (IaC) best practices to enable repeatable, auditable, and scalable cluster provisioning.
Enhance and harden CI/CD deployment pipelines to ensure robust, secure, and low-latency model service delivery across production environments.
Own and evolve stable training infrastructure operating at the scale of 10,000+ GPUs, including job scheduling, fault tolerance, and network fabric optimization.
Partner closely with ML researchers and engineers to understand compute bottlenecks and translate them into infrastructure improvements.
Monitor system health, define SLOs, and lead incident response for critical training and inference workloads.
Drive capacity planning, cost efficiency initiatives, and hardware lifecycle management across the GPU fleet.
Contribute to internal tooling and platform abstractions that improve developer experience for teams consuming compute resources.

Requirements

5+ years of experience in infrastructure, systems, or platform engineering, with at least 2 years working in ML or HPC environments.
Demonstrated experience managing GPU clusters or large-scale distributed compute infrastructure.
Strong proficiency in at least one systems or infrastructure programming language.
Deep understanding of networking fundamentals (RDMA, InfiniBand, or RoCE a plus) relevant to high-throughput training workloads.
Experience with container orchestration, job scheduling, and multi-tenant resource management.
Proven track record owning production systems with high reliability requirements.
Strong debugging and observability skills across the full infrastructure stack.

Bonus Qualifications

Kubernetes (K8s) — particularly experience operating large, GPU-aware clusters.
Pulumi or similar modern IaC tooling.
Rust and/or Go for systems-level tooling and performance-critical services.
Familiarity with PyTorch and Ray for understanding workload patterns and integration requirements.

Compensation

The US base salary range for this full-time position is between $180,000 - $450,000 annually.

The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components and benefits depending on the specific role. This information will be shared if an employment offer is extended.

Skills & Tags

go rust kubernetes machine learning ai data product design

Aplyr's read

Hark leverages AI-driven data analytics to deliver business insights, attracting a diverse team of engineers, designers, and technical experts.
Synthesized from recent postings & public sources

What's promising

•Hark's focus on AI and machine learning positions it at the forefront of data-driven business solutions.
•The company offers diverse roles, from engineering to creative social leads, indicating a broad scope of operations.
•Hark's recent hires in specialized technical fields suggest a commitment to cutting-edge technology and innovation.

What to watch

•The competitive landscape in AI analytics could challenge Hark's market share and growth.
•Limited public information about Hark's financial health and long-term sustainability.
•Potentially high-pressure environment due to the fast-paced nature of AI and tech development.

Why hark

•Hark's integration of AI with multimodal capabilities sets it apart in data analytics.
•The company's emphasis on both technical and creative roles highlights a balanced approach to innovation.
•Hark's recruitment of niche technical experts suggests a focus on specialized, advanced technology solutions.

Aplyr’s read is generated by AI from public sources. Was it useful?

About hark

hark

hark.com

View company

Hark is a data analytics platform that specializes in providing insights for businesses through the use of artificial intelligence and machine learning.