Back to Search






Principal
Principal PMT-ES - AI/ML Training, Annapurna Labs
Confirmed live in the last 24 hours
Annapurna Labs (U.S.) Inc.
Cupertino, CA, USA
On-site
Posted April 16, 2026
Job Description
AWS Trainium is deployed at scale, with millions of chips in production, used for training and inference of frontier models. AWS Neuron is the software stack for Trainium, enabling customers to run deep learning and generative AI workloads with optimal performance and cost efficiency.
AWS Neuron is hiring a Principal Technical Product Manager to define and drive product strategy for training software on Trainium. This includes distributed training libraries, post-training workflows (RLHF, DPO, fine-tuning), reinforcement learning frameworks, and training performance optimization. Your mission is to enable researchers and operators to train frontier models at scale on Trainium, from single-node experimentation to distributed training across thousands of nodes.
You will be the champion inside AWS for frontier model builders pushing the bounds of scale and resilience for current and emerging training paradigms. You will work with customers inside and outside the company to identify key improvements and stay ahead of the training landscape. You will define how Neuron supports the training AI/ML ecosystem and what tools customers will use for their training workflows on Trainium.
To be successful, you will partner with engineering teams building training libraries and distributed training infrastructure, applied scientists developing optimization techniques, and PMs responsible for compiler, runtime, NKI, and infrastructure. You will develop deep knowledge of AI/ML training architectures, distributed training systems, model parallelism strategies, and training performance optimization to effectively define product strategy and make informed technical decisions.
The Ideal Candidate
The ideal candidate will have solid understanding of large-scale model training, distributed training architectures, post-training workflows, and reinforcement learning. They should be able to assess technical implications of training software stack decisions, understand customer needs, and drive developer experience improvements. The ideal candidate can navigate ambiguity in a fast-moving, early-stage initiative, balance competing priorities across multiple workstreams, and drive alignment across engineering and science stakeholders with excellent written and verbal communication abilities
Key job responsibilities
Training Product Strategy & Roadmap
Define and execute training product strategy and roadmap working backwards from customer requirements in collaboration with engineering leadership. Define the vision for how customers train frontier models at scale on Trainium, balancing performance, developer experience, and AI/ML ecosystem compatibility. Produce PRFAQs and PRDs for training capabilities. Drive technical alignment across Neuron training libraries, distributed training infrastructure, and dependencies. Partner with PMs responsible for compiler, NKI, runtime, and infrastructure. Drive trade-offs between training performance, scalability, developer experience, and AI/ML ecosystem compatibility. Define requirements for reusable training building blocks that compose into end-to-end workflows.
Post-Training, RL & Emerging Workflows
Drive strategy for post-training workflows including RLHF, DPO, reward modeling, and fine-tuning at scale. Define requirements for how Neuron supports emerging training paradigms, model architectures, and RL-based optimization loops. Lead the product experience for RL research-to-production workflows on Trainium. Create and optimize RL libraries and frameworks to help researchers and production model builders.
Customer Engagement & Enablement
Work with BD, Solutions Architecture, and GTM teams to engage customers training frontier models on Trainium. Understand their distributed training challenges, RL needs, performance optimization requirements, and framework preferences. Translate customer pain points into product requirements. Define success metrics for training adoption and performance. Support customer enablement for training migration and optimization.
Training AI/ML Ecosystem & Delivery
Define how Neuron supports the training AI/ML ecosystem and what tools customers will use for their training workflows on Trainium. Own the technical depth on training-specific AI/ML ecosystem tools and define how Neuron's training libraries integrate with them. Track training-specific AI/ML ecosystem trends and feed them into product planning. Drive open source community engagement and upstream contributions for training-related tools. Coordinate with BD on partnership discussions where training-specific technical input is needed.
Launch & Go-to-Market
Lead end-to-end launches for training capabilities, coordinating documentation, field enablement, and customer communications. Partner with Marketing and Solutions Architecture to drive awareness and adoption.
AWS Neuron is hiring a Principal Technical Product Manager to define and drive product strategy for training software on Trainium. This includes distributed training libraries, post-training workflows (RLHF, DPO, fine-tuning), reinforcement learning frameworks, and training performance optimization. Your mission is to enable researchers and operators to train frontier models at scale on Trainium, from single-node experimentation to distributed training across thousands of nodes.
You will be the champion inside AWS for frontier model builders pushing the bounds of scale and resilience for current and emerging training paradigms. You will work with customers inside and outside the company to identify key improvements and stay ahead of the training landscape. You will define how Neuron supports the training AI/ML ecosystem and what tools customers will use for their training workflows on Trainium.
To be successful, you will partner with engineering teams building training libraries and distributed training infrastructure, applied scientists developing optimization techniques, and PMs responsible for compiler, runtime, NKI, and infrastructure. You will develop deep knowledge of AI/ML training architectures, distributed training systems, model parallelism strategies, and training performance optimization to effectively define product strategy and make informed technical decisions.
The Ideal Candidate
The ideal candidate will have solid understanding of large-scale model training, distributed training architectures, post-training workflows, and reinforcement learning. They should be able to assess technical implications of training software stack decisions, understand customer needs, and drive developer experience improvements. The ideal candidate can navigate ambiguity in a fast-moving, early-stage initiative, balance competing priorities across multiple workstreams, and drive alignment across engineering and science stakeholders with excellent written and verbal communication abilities
Key job responsibilities
Training Product Strategy & Roadmap
Define and execute training product strategy and roadmap working backwards from customer requirements in collaboration with engineering leadership. Define the vision for how customers train frontier models at scale on Trainium, balancing performance, developer experience, and AI/ML ecosystem compatibility. Produce PRFAQs and PRDs for training capabilities. Drive technical alignment across Neuron training libraries, distributed training infrastructure, and dependencies. Partner with PMs responsible for compiler, NKI, runtime, and infrastructure. Drive trade-offs between training performance, scalability, developer experience, and AI/ML ecosystem compatibility. Define requirements for reusable training building blocks that compose into end-to-end workflows.
Post-Training, RL & Emerging Workflows
Drive strategy for post-training workflows including RLHF, DPO, reward modeling, and fine-tuning at scale. Define requirements for how Neuron supports emerging training paradigms, model architectures, and RL-based optimization loops. Lead the product experience for RL research-to-production workflows on Trainium. Create and optimize RL libraries and frameworks to help researchers and production model builders.
Customer Engagement & Enablement
Work with BD, Solutions Architecture, and GTM teams to engage customers training frontier models on Trainium. Understand their distributed training challenges, RL needs, performance optimization requirements, and framework preferences. Translate customer pain points into product requirements. Define success metrics for training adoption and performance. Support customer enablement for training migration and optimization.
Training AI/ML Ecosystem & Delivery
Define how Neuron supports the training AI/ML ecosystem and what tools customers will use for their training workflows on Trainium. Own the technical depth on training-specific AI/ML ecosystem tools and define how Neuron's training libraries integrate with them. Track training-specific AI/ML ecosystem trends and feed them into product planning. Drive open source community engagement and upstream contributions for training-related tools. Coordinate with BD on partnership discussions where training-specific technical input is needed.
Launch & Go-to-Market
Lead end-to-end launches for training capabilities, coordinating documentation, field enablement, and customer communications. Partner with Marketing and Solutions Architecture to drive awareness and adoption.
nodegorustawsmachine learningaidataanalyticsproductdesign
Similar Jobs
Johnson & Johnson
Post Doc Scientist Data Science AI/ML
Mid-LevelZug, Switzerland
Lowe's
Part Time - Fulfillment Associate - Flexible
Mid-LevelFredericksburg, VA (...
Lowe's
Full Time - Fulfillment Associate - Day
Mid-LevelHillsboro, OH 2343
Postman
Solutions Engineering Leader - India
Lead / ManagerBengaluru, Karnataka...
Govtech
Assistant Director / Senior Manager, AI Strategy and Plans
SeniorSingapore
Bosch
[BD] Senior Artificial Inteligence Developer
SeniorHo Chi Minh