Senior Principal Infrastructure Services (SRE Practice)
Confirmed live in the last 24 hours
Northern Trust
Job Description
About Northern Trust:
Northern Trust, a Fortune 500 company, is a globally recognized, award-winning financial institution that has been in continuous operation since 1889.
Northern Trust is proud to provide innovative financial services and guidance to the world’s most successful individuals, families, and institutions by remaining true to our enduring principles of service, expertise, and integrity. With more than 130 years of financial experience and over 22,000 partners, we serve the world’s most sophisticated clients using leading technology and exceptional service.
About the Company & Role
Northern Trust is seeking an experienced Sr. Principal Site Reliability Engineer with a strong focus on developing observability and automation. This role will play a pivotal part in ensuring the reliability and performance of the company’s systems and services. As a Site Reliability DevOps Engineer, you will be responsible for defining and deploying key observability services with a deep focus on architecture, production operations, capacity planning, performance management, deployment, and release engineering. You will work with cross-functional teams to assist with providing efficiency of our services. Your expertise in both software engineering and system operations will enable our partners to drive continuous improvements in our platform’s reliability. This role will focus on bringing complete observability across all technologies.
This role will be responsible for a number of key functions that both support and drive improvements to the reliability of Northern Trust’s IT Landscape.
What you will do
Reliability‑Focused System Design & Architecture
- Lead the design and evolution of highly reliable, scalable, and performant distributed systems, applying SRE principles across infrastructure and application layers.
- Partner with engineering and architecture teams to influence system design decisions that improve resilience, fault tolerance, and operational simplicity.
- Define and promote reliability patterns, architectural best practices, and non‑functional requirements aligned with business criticality.
SRE Operations & Automation
- Drive an automation‑first approach by designing and developing tools, scripts, and platforms that reduce manual effort, operational toil, and human error.
- Embed reliability engineering into the software delivery lifecycle through CI/CD integration, safe deployments, and repeatable operational workflows.
- Establish clear operational metrics and service health indicators to ensure transparency and accountability.
Incident Management & Root Cause Analysis
- Participate in and lead incident response for production systems, ensuring timely mitigation and minimal customer or business impact.
- Conduct and drive blameless post‑incident reviews, focusing on identifying systemic causes rather than individual faults.
- Implement long‑term corrective actions to prevent recurrence and measurably improve system reliability.
Monitoring, Alerting & Observability
- Architect and implement end‑to‑end observability across systems using metrics, logs, and traces to enable rapid diagnosis and proactive issue detection.
- Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to balance reliability with feature velocity.
- Build and maintain actionable dashboards and alerts that provide real‑time insights into system health, performance, and risk.
Continuous Reliability Improvement
- Identify reliability gaps through data analysis, failure reviews, and resilience testing, driving targeted improvement initiatives.
- Lead efforts such as capacity planning, load testing, chaos engineering, and fault injection to validate system behavior under stress.
- Continuously reduce operational toil, improve mean time to detect (MTTD) and mean time to recover (MTTR), and raise overall service maturity.
Documentation & Knowledge Sharing
- Create and maintain clear, accurate, and actionable documentation including system architectures, runbooks, operational standards, and incident playbooks.
- Ensure documentation supports operational readiness, repeatability, and effective knowledge transfer across teams.
Cross‑Functional Collaboration & Influence
- Work closely with product, development, platform, security, and operations teams to embed SRE principles into roadmap planning and delivery.
- Act as a trusted advisor, translating reliability data and operational risk into business‑relevant insights for technical and non‑technical stakeholders.
- Advocate for SRE best practices and help build a strong reliability culture across the organization.
Project & Initiative Leadership
- Manage and prioritize multiple reliability‑focused initiatives, balancing short‑term operational needs with long‑term system health.
- Drive execution of strategic SRE programs that measurably improve system resilience, scalability, and operational efficiency.
Qualifications & Experience
Bachelor’s degree in Computer Science, Engineering, or a related discipline, or equivalent practical experience demonstrating advanced technical and leadership capabilities.
15+ years of progressive experience in systems engineering with a strong emphasis on site reliability, large‑scale systems operations, and software engineering in complex enterprise or cloud environments.
7+ years of experience in a technical leadership role (Team Lead or Hands‑on Technical Manager), with a proven track record of driving cross‑functional initiatives and delivering complex projects to successful completion.
Strong proficiency in one or more modern programming languages such as Python, Go, Java, Ruby, or equivalent, with a software‑engineering mindset applied to operational challenges.
Demonstrated experience operating and supporting systems across hybrid environments, including both on‑premises infrastructure and public/private cloud platforms.
Hands‑on experience with containerization and container orchestration technologies, enabling scalable, resilient, and repeatable deployments.
Proven ability to design and implement observability solutions, including metrics, logs, traces, dashboards, and alerts that provide actionable insights into system health and performance.
Deep understanding of distributed systems, networking fundamentals, failure modes, and modern software architectures, with the ability to reason about complex system behaviors under load or failure conditions.
Exceptional problem‑solving skills with the ability to diagnose, mitigate, and permanently resolve complex, high‑impact technical issues.
Strong customer and stakeholder orientation, with excellent communication skills and the ability to articulate complex reliability strategies clearly and persuasively to both technical and non‑technical audiences.
Prior experience designing and delivering Infrastructure as Code (IaC) through automated CI/CD pipelines, ensuring consistency, scalability, and reliability of infrastructure changes.
Demonstrated success in mentoring, coaching, and developing high‑performing technical teams, fostering a culture of engineering excellence, ownership, and continuous improvement.
Hands‑on expertise in implementing automated remediation and corrective actions driven by observability signals and reliability metrics.
Practical experience working within Agile and DevOps environments, collaborating closely with product and engineering teams to balance reliability, velocity, and innovation.
Working with Us:
As a Northern Trust partner, greater achievements await. You will be part of a flexible and collaborative work culture in an organization where financial strength and stability is an asset that emboldens us to explore new ideas.
Movement within the organization is encouraged, senior leaders are accessible, and you can take pride in working for a company committed to assisting the communities we serve! Join a workplace with a greater purpose.
We’d love to learn more about how your interests and experience could be a fit with one of the world’s most admired and sustainable companies! Build your career with us and apply today. #MadeForGreater
Reasonable accommodation
Northern Trust is committed to working with and providing reasonable accommodations to individuals with disabilities. If you need a reasonable accommodation for any part of the employment process, please email our HR Service Center at MyHRHelp@ntrs.com.
We hope you’re excited about the role and the opportunity to work with us. We value an inclusive workplace and understand flexibility means different things to different people.
Apply today and talk to us about your flexible working requirements and together we can achieve greater.
About Our Pune Office
The Northern Trust Pune office, established in 2016, is now home to over 3,000 employees. The office handles various functions, including Operations for Asset Servicing and Wealth Management, as well as delivering critical technology solutions that support business operations across the globe.
Our Pune team takes our commitment to service to heart. In 2024, they volunteered more than 10,000+ hours into the communities where they live and work. Learn more.
Similar Jobs
Medtronic
Senior Principal Enterprise Software Engineer
Toast
Senior Principal Software Engineer, Team Agent
2K
Senior Data Architect
Airbnb
Senior Software Engineer, BizTech(AI Products)
Coinbase
Senior Data Protection Engineer
Coinbase