Senior Site Reliability Engineer I - Observability
Confirmed live in the last 24 hours
Careem
Job Description
Careem is building the Everything App for the greater Middle East — making it easy to move around, order food and groceries, manage payments, and more. Our purpose is simple: to simplify and improve people’s lives and build an awesome organisation that inspires.
Since 2012, Careem has enabled earnings for over 2.5 million Captains, simplified the lives of more than 70 million customers, and built a platform where the region’s best talent and entrepreneurs thrive. We operate in 70+ cities across 10 countries, from Morocco to Pakistan.
We’re now entering our next chapter — one powered by AI. We’re looking for AI talent: curious problem-solvers who know how to apply AI to build tools, automate workflows, and create real impact. Whether it’s streamlining operations, enhancing customer experience, or reimagining internal systems — we want people who can make Careem work smarter and move faster.
We are looking for someone passionate about automation, tooling, and observability frameworks to join our Infrastructure Observability team. You will be part of the team mandated to build and manage a world-class observability ecosystem—leveraging both in-house tools and enterprise solutions. Your goal is to enable all projects across Careem to improve visibility, gain deep insights into system events, and ensure teams can define robust alerts to be notified instantly in case of incidents.
What you'll do
- Develop and maintain our distributed monitoring ecosystem, integrating enterprise solutions to meet challenging functional, scalability, and reliability requirements.
- Design and architect observability solutions with a focus on automation, testability, and long-term maintainability.
- Coach and mentor colleagues on an energetic, growing team to raise the bar for engineering excellence.
- Facilitate collaboration with other engineers and product owners to solve complex observability challenges across the Careem platform.
- Build and ship new features and monitoring-as-code configurations with an emphasis on code quality, readability, and testing.
- Maintain and extend a variety of systems, including open-source (Prometheus/Otel/Clickhouse), ready-made and in-house applications.
- Maintain a high standard for code quality and know what it means to ship reliable, production-ready configurations.
- 5+ years of experience with monitoring systems like: Prometheus, NewRelic, or Dynatrace.
- Strong experience in developing and debugging in one of these languages: Go, Python, Java, or Bash.
- Experience on Kubernetes and how to monitor containerized microservices at scale.
- Experience with Cloud Infrastructure (AWS preferred)
- Experience with infrastructure automation (Terraform)
- Experience in design, architecting, operating, and troubleshooting highly available, distributed systems at scale.
- Experience in building and owning tools for medium to large engineering teams.
- Experience of building systems, dashboards and metrics to facilitate a data-driven approach to problem resolution.
- Strong Unix/Linux background, including network stack performance and scripting.
- Obsession about keeping costs low while building solutions.
- Experience in multi-tiered distributed systems.
- Proficient in configuring and optimizing the &
Similar Jobs
Axon
Senior Site Reliability Engineer I
Fivetran
Senior Site Reliability Engineer
Fivetran
Senior Site Reliability Engineer
Fivetran
Senior Site Reliability Engineer
Fivetran
Senior Staff Site Reliability Engineer
Fireblocks