The Opportunity
Postman is seeking an experienced AI Systems Reliability Engineer to help define, build, and maintain the infrastructure and processes that ensure the reliability, scalability, and performance of Postman’s AI-powered API and agentic systems in production. This role focuses on monitoring, availability, incident response, and automation to support AI services and tools trusted by millions of developers globally.
What You’ll Do
-
Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features
-
Implement comprehensive observability and monitoring systems for real-time performance and fault detection
-
Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure
-
Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation
-
Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals
-
Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence
-
Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes
About You
-
Have a strong background in AI reliability engineering, SRE, or DevOps for distributed systems
-
Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks
-
Are experienced with cloud platforms, monitoring tools, and incident response automation
-
Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health
-
Thrive in dynamic, fast-paced environments focusing on delivering rel
Similar Jobs
xAI
Member of Technical Staff - Multimodal Understanding
xAI
Member of Technical Staff - Multimodal
xAI
Member of Technical Staff - Infrastructure Reliability
xAI