Senior Site Reliability Engineer
Confirmed live in the last 24 hours
Pylon
Compensation
$140,000/yr
Job Description
About Pylon
America’s $13T mortgage market is one of the most important financial systems in the world. It underwrites the middle class and is the mechanism through which millions of families build wealth. But while every other financial instrument has been simplified to an API call, mortgages are still assembled by hand.
We started from zero and created the first vertically integrated mortgage platform that turns origination into a single API.
Publicly traded companies and the country’s largest originators are already building on Pylon. Revenue is compounding monthly. We’re backed by Peter Thiel, Conversion Capital, QED, Citi, Fifth Wall, and the founders of Ramp, Blend, and Mercury.
Working at Pylon isn’t for those seeking comfort. The people who thrive here have high agency, strong opinions, and a track record of delivering outcomes without direction. Many of us are former founders. We move quickly, challenge each other directly, and take full ownership of results. It’s hard work, but it will be worth it.
Join us in building America’s mortgage rails.
About the Job
You'll own reliability and operational excellence for Pylon's production systems. This means designing and implementing monitoring, alerting, and incident response processes that scale as we grow. You'll build tooling that makes the entire engineering team more effective, establish on-call rotations and runbooks, and ensure our platform can handle the demands of a regulated, high-stakes financial product.
This is not a pure ops role. At Pylon, we believe SRE work should be a maximum of 50% operational toil. If you're spending more than half your time firefighting and keeping things running, you're not doing SRE work, you're doing sysadmin work. The other 50%+ of your time should be spent writing code: building infrastructure tooling, automating away operational burden, making reliability improvements to core services, and creating internal developer productivity tools that make the entire team more effective.
SRE is about making things better, not just keeping them alive.
We're looking for someone who has operated production systems at scale in a professional engineering environment. You know what good looks like because you've built it before.
What We're Looking For
Must-haves:
4+ years experience in SRE, infrastructure, or platform engineering roles
Experience working on a team of SREs at a company with mature SRE practices (not solo SRE roles)
Real on-call experience at scale in a large production environment (you've carried the pager and lived through incidents)
Deep AWS expertise (ECS, RDS, networking, security)
Strong experience with declarative infrastructure (Terraform, CDK, or similar)
Nix experience (we use it and want to expand its adoption)
Track record of building reliability tooling and automation
Can design and implement monitoring, alerting, and observability systems from first principles
Comfortable working in a regulated environment where "breaking things" is not an option
Nice-to-haves:
Experience at companies with strong SRE cultures (Google, Replit, Stripe, etc.)
Background in fintech, healthtech, or other regulated domains
Experience migrating monitoring systems or implementing SLOs
Contributions to infrastructure tooling or open source projects
Our technology stack:
We don't require that you've worked with any of these technologies before, this is just our stack for your information:
Infrastructure: AWS (ECS, RDS, CloudFront, Lambda), CDK for infrastructure-as-code
Observability: Honeycomb, OpenTelemetry
CI/CD: GitHub Actions, Nix for builds and dev environments
Core platform: TypeScript/Node backend, PostgreSQL, React frontend
Languages: TypeScript, Python, Nix, SQL
About you
You:
Have operated production systems at scale. You've been on-call for a large, complex system. You know what 3am pages feel like and you've built systems to prevent them. You understand the difference between alerts that matter and noise.
Write code, not just YAML. You can build internal tools, automation, and reliability improvements. You're comfortable contributing to the core product when reliability requires it. You can read and understand the codebase you're responsible for keeping up.
Think in systems. You understand distributed systems, failure modes, cascading failures, and graceful degradation. You can diagnose production issues quickly and know when to escalate vs. when to fix.
Know your tools deeply. You've used observability platforms at scale and understand how to instrument systems properly. You can design alerting that has high signal and low noise. You know AWS inside and out.
Have strong opinions that you're willing to defend. We have a culture of vigorous discussion and debate on technical decisions. We'll push you to defend your choices, and we want you to push back.
Don't settle. Challenge yourself to frequently and consistently deliver exceptional work. If something could be more reliable, take the initiative to improve it.
Have great ideas, and lots of them. You should see opportunities all around you to make the infrastructure, tooling and processes better. We'll give you an environment where you can act on those ideas.
Are self-motivated. You can take a goal and drive towards it without needing extensive hand-holding. The team is supportive and loves to share knowledge and advice, but there's no time for micromanaging your work.
Are comfortable with ambiguity. There's a million ways to do things; you should feel at ease making a decision under uncertainty while balancing competing constraints.
Are confident you can learn quickly. Mortgage is complex, our platform is complex, good SRE work is complex. You've got to have an attitude that you can absorb it, get on top of it, and build something better than what came before.
Care about developer experience. Your work enables the entire team to ship faster and more confidently. You think about how to make the whole organization more effective.
Who Will Succeed Here
Someone who:
Is deeply curious
Wants to won features from design to development to deployment to maintence
Is willing to put the work in to solve the hardest of problems
Location: Palo Alto , CA Base Salary Range: $140,000/yr to $220,000/yr + Equity + Benefits
Similar Jobs
Lloyds Banking Group
Senior Site Reliability Engineer (Public Cloud)
Verisign
Senior Site Reliability Engineer
AbbVie
Senior Site Reliability Engineer (Remote)
fal.ai
Site Reliability Engineer (Mid/Senior/Staff)
GitLab
Senior Site Reliability Engineer, Tenant Services: Geo
MongoDB