Back to Search






Lead / Manager
Lead Engineer for Manufacturing and Datacenter Lab, Trainium Manufacturing, Quality and Reliability
Confirmed live in the last 24 hours
Annapurna Labs (U.S.) Inc.
Austin, TX, USA
On-site
Posted April 8, 2026
Job Description
Within the Trainium Manufacturing Quality & Reliability (TRN MQR) organization, we are establishing a critical new function that bridges manufacturing outcomes with datacenter operational performance. We are seeking a talented and motivated Manufacturing & Datacenter Preparedness Lab Leader to build and lead this strategic capability in Austin, Texas.
This role will report to the leader of Trainium Manufacturing Quality & Reliability and serve as the essential feedback loop between our ODM/JDM/CM manufacturing operations and AWS datacenter fleet performance. You will establish and operate a specialized preparedness lab focused on analyzing datacenter performance of manufactured Trainium systems to identify root causes of field rework and repairs, feeding critical insights back into manufacturing processes, test strategies, and design improvements.
You will participate in the early phase of manufacturing line development for our next generation servers and racks to improve our manufacturing flows informing system design, manufacturing, and fleet operations. You will manage early lifecycle changes, identify initial product quality improvements, and drive to technical root cause in supplier quality activities. The candidate will have experience in design or manufacturing and is capable of making wide-ranging business decisions on behalf of the organization.
You'll join a diverse team working across Manufacturing Engineering, Manufacturing Test Engineering, and Quality & Reliability Engineering. You'll collaborate with people across AWS Data Center Engineering, Hardware Design, ODM/JDM/CM partners, and datacenter operations teams to help us deliver the highest standards for safety and reliability while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
Key job responsibilities
- Own operational production performance of Trainium systems across entire product lifecycle from manufacturing through datacenter deployment and fleet operations
- Design and build preparedness lab replicating datacenter conditions for assembly, repair and system testing
- Define and drive assembly and repair recipes in the manufacturing lab as the baseline prior to high volume manufacturing and datacenter deployment.
- Ensure all manufacturing and datacenter test flows are regressed in the manufacturing lab prior to deployment.
- Influence hardware design strategy for Design for Manufacturing (DFM), Design for Reliability (DFR), and Design for Test (DFT) based on field failure analysis
- Establish data-driven analytics frameworks connecting manufacturing test data to datacenter performance, leveraging ML techniques to predict field failures
- Build and mentor cross-functional team spanning manufacturing, test, quality, and reliability engineering; perform technical promotion assessments as force multiplier
- Collaborate with AWS datacenter operations teams to understand failure modes, repair patterns, and operational challenges firsthand; translate operator insights and field learnings into actionable manufacturing process improvements and design changes
- Drive continuous improvement reducing failure rates and lifecycle degradation through rapid feedback loops
- Develop or adapt manufacturing process at the ODM and CM, including defining fixture requirements, critical assembly requirements, test methodology, signal integrity, power and heat management requirement
About the team
Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro(K2), Graviton, Inferentia, and Trainium families of processors.
Machine Learning Annapurna functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization.
We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability.
This position is in the Manufacturing, Quality and Reliability team.
- 8+ years industry experience in one or more of the following: Manufacturing Engineering, Test Engineering, Quality Engineering, Reliability Engineering, or Datacenter Infrastructure Engineering
- 7+ years working directly with engineering teams in cross-functional environments
- Experience with AI/ML acceleration systems, high-performance computing servers, or complex multi-rack systems
- Demonstrated track record delivering stable, performant hardware solutions meeting cost and quality targets
- Experience with System Mechanical & Thermal design f
This role will report to the leader of Trainium Manufacturing Quality & Reliability and serve as the essential feedback loop between our ODM/JDM/CM manufacturing operations and AWS datacenter fleet performance. You will establish and operate a specialized preparedness lab focused on analyzing datacenter performance of manufactured Trainium systems to identify root causes of field rework and repairs, feeding critical insights back into manufacturing processes, test strategies, and design improvements.
You will participate in the early phase of manufacturing line development for our next generation servers and racks to improve our manufacturing flows informing system design, manufacturing, and fleet operations. You will manage early lifecycle changes, identify initial product quality improvements, and drive to technical root cause in supplier quality activities. The candidate will have experience in design or manufacturing and is capable of making wide-ranging business decisions on behalf of the organization.
You'll join a diverse team working across Manufacturing Engineering, Manufacturing Test Engineering, and Quality & Reliability Engineering. You'll collaborate with people across AWS Data Center Engineering, Hardware Design, ODM/JDM/CM partners, and datacenter operations teams to help us deliver the highest standards for safety and reliability while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
Key job responsibilities
- Own operational production performance of Trainium systems across entire product lifecycle from manufacturing through datacenter deployment and fleet operations
- Design and build preparedness lab replicating datacenter conditions for assembly, repair and system testing
- Define and drive assembly and repair recipes in the manufacturing lab as the baseline prior to high volume manufacturing and datacenter deployment.
- Ensure all manufacturing and datacenter test flows are regressed in the manufacturing lab prior to deployment.
- Influence hardware design strategy for Design for Manufacturing (DFM), Design for Reliability (DFR), and Design for Test (DFT) based on field failure analysis
- Establish data-driven analytics frameworks connecting manufacturing test data to datacenter performance, leveraging ML techniques to predict field failures
- Build and mentor cross-functional team spanning manufacturing, test, quality, and reliability engineering; perform technical promotion assessments as force multiplier
- Collaborate with AWS datacenter operations teams to understand failure modes, repair patterns, and operational challenges firsthand; translate operator insights and field learnings into actionable manufacturing process improvements and design changes
- Drive continuous improvement reducing failure rates and lifecycle degradation through rapid feedback loops
- Develop or adapt manufacturing process at the ODM and CM, including defining fixture requirements, critical assembly requirements, test methodology, signal integrity, power and heat management requirement
About the team
Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro(K2), Graviton, Inferentia, and Trainium families of processors.
Machine Learning Annapurna functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization.
We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability.
This position is in the Manufacturing, Quality and Reliability team.
Basic Qualifications
- BS or MS degree in Electrical Engineering, Mechanical Engineering, Computer Engineering, Industrial Engineering, or related technical fields- 8+ years industry experience in one or more of the following: Manufacturing Engineering, Test Engineering, Quality Engineering, Reliability Engineering, or Datacenter Infrastructure Engineering
- 7+ years working directly with engineering teams in cross-functional environments
- Experience with AI/ML acceleration systems, high-performance computing servers, or complex multi-rack systems
- Demonstrated track record delivering stable, performant hardware solutions meeting cost and quality targets
- Experience with System Mechanical & Thermal design f
pythonawsmachine learningaidataanalyticsproductdesign
Similar Jobs
Re:Build Manufacturing
Senior AI Engineer
SeniorGreater Boston, MA; ...$143,000 to $215,000
Airbus
Manufacturing Design Engineer
Mid-LevelSantiago De Querétar...
Applied Materials
Senior New Product Manufacturing Engineer
SeniorBangalore,IND
Blue Origin
Sr. Manufacturing Engineer - Lunar Operations
SeniorGreater Seattle Area
Amazon Kuiper Manufacturing Enterprises LLC
Sr. Software Dev Engineer, Leo AI Foundations
SeniorRedmond, WA, USA
Annapurna Labs (U.S.) Inc.
Systems Test Engineer - AI Hardware Manufacturing, Annapurna Labs
Mid-LevelAustin, TX, USA$125,700 - $219,900/year