Senior Site Reliability Engineer
Confirmed live in the last 24 hours
DriveWealth
Compensation
$150,000 - $170,000/year
Job Description
About Us
DriveWealth is on a mission to make investing easier. We believe that everyone should have the ability to control their financial future, and that access to financial markets should not be limited by geography, wealth, or legacy systems. We are a global B2B financial technology organization dedicated to democratizing access to financial independence around the world. Our mission is realized through an API-based platform, empowering our partners to offer seamless investing and trading experiences to clients worldwide, all from their mobile devices. Our technology provides partners with a modern, extensible toolkit, enabling traditional investment workflows and innovative techniques like fractional share ownership. DriveWealth has evolved into a global platform offering trading of US equities, mutual funds, ETFs, fixed income, and options.
There’s never been a better time to build a category-defining business and there has rarely been a team better positioned for this opportunity. Our culture blends the pace and agility of a fintech start-up with the impact, stability, and discipline of Wall Street. We encourage creativity and experimentation while ensuring institutional-grade execution and regulatory compliance in everything we do. Join us and help build the future of global investing!
About The Role
As a Senior Site Reliability Engineer, you won’t just be "keeping the lights on." You will be an engineering force responsible for the architecture, scalability, and self-healing capabilities of our Brokerage-as-a-Service platform.
This role is centered on reducing toil through engineering. You will design and develop internal SRE platforms, automate complex workflows, and ensure our Kubernetes-based ecosystem can handle the demands of global financial markets. While this role includes critical on-call responsibilities to support our 24/7 global operations, your primary mission is to build and modernize systems that make manual intervention obsolete.
What You’ll Do
- Engineering & Automation: Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity.
- Infrastructure as Code: Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD.
- Observability & Reliability: Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets.
- Platform Governance: Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions.
- Incident Engineering: Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture.
- Collaboration: Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices.
What You'll Need
- Linux & Networking Mastery: Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting.
FinTech Background: Experience working in highly regulated financial environments or with FIX/API connectivity. - Production Kubernetes: Hands-on experience managing production-grade clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns.
- Cloud Native Expertise (AWS): Strong grasp of AWS core services, security, and high-availability patterns. Proficiency with boto3 and AWS CLI for automation.
- Modern CI/CD & GitOps: Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD).
- Code Proficiency: Strong scripting and development skills in Python or Golang, along with Bash and Ansible.
- Security Mindset: Experience with secrets management, vulnerability scanning, and securing the software supply chain.
- AI & Prompt Engineering: Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows.
- Data & Middleware: Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck.
Applicants must be authorized to work for any employer in the U.S. DriveWealth is unable to sponsor or take over sponsorship of an employment Visa at this time.
Similar Jobs
GE HealthCare
Site Planning Designer
Amazon Commercial Services Pty Ltd
Operations Supervisor, New Site
Thrive Market
Site Director, Fulfillment
Cerebras Systems
Senior Technical Program Manager – AI Infrastructure, Site Operations
Apple
AIML Data Operations - Site Project Representative
Apple