Lead Software Engineer, Devops Platform (Bangkok based, relocation provided)
Confirmed live in the last 24 hours
Agoda
Job Description
About Agoda
At Agoda, we bridge the world through travel. Our story began in 2005, when two lifelong friends and entrepreneurs, driven by their passion for travel, launched Agoda to make it easier for everyone to explore the world.
Today, we are part of Booking Holdings [NASDAQ: BKNG], with a diverse team of over 7,000 people from 90 countries, working together in offices around the globe. Every day, we connect people to destinations and experiences, with our great deals across our millions of hotels and holiday properties, flights, and experiences worldwide.
No two days are the same at Agoda. Data and technology are at the heart of our culture, fueling our curiosity and innovation. If you’re ready to begin your best journey and help build travel for the world, join us.
In this Role, you'll get to:
· Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives.
· Define and promote SRE best practices across Agoda’s services e.g., SLI/SLO-driven engineering, error budgets, and other data-driven reliability factors.
· Design, build, and operate reliability platforms including load shedding , business signals monitoring, and safe-deployment automation to reduce blast radius while preserving developer velocity..
· Own safe deployment strategies such as canary releases, automated rollback, and business-impact protection integrated with deployment & monitoring.
· Proactively identify and mitigate reliability and scaling risks across Agoda’s services.
· Improve system resilience and multi-cluster readiness by partnering with platform team and operation team.
· Lead major incident response and operational excellence, driving fast detection, mitigation, root cause analysis, postmortems, and learnings focused on business impact.
· Maintain and evolve incident, observability, alerting, and on-call tooling, improving signal quality, alert enrichment, grouping, and reducing time-to-clue and time-to-mitigation for NOC and on-call engineers.
· Advance platform observability and reliability signals using Prometheus and Grafana, balancing actionability, scale, and cost efficiency.
· Define reliability roadmaps and OKRs, translating ambiguous business reliability goals into clear technical requirements.
What You’ll Need to Succeed:
· 8+ years of relevant experience.
· Demonstrated ownership of architecting, building, and operating mission-critical production systems, making long-term technical and reliability trade-off decisions.
· Proven ability to lead and coordinate complex cross-team initiatives, setting technical direction and aligning stakeholders to deliver outcomes at organizational scale.
· Expertise in one or more programming skills (e.g., Go, Python, Rust, Java) with a solid understanding of distributed systems fundamentals (concurrency, backpressure, timeouts/retries, idempotency, circuit breaking).
· Deep hands-on experience with the Kubernetes ecosystem, service mesh technologies (e.g., Istio), Kubernetes deployment workflows (e.g., Argo CD).
· Observability & monitoring expertise, using Prometheus, Grafana, and common logging/telemetry stacks (e.g., OpenTelemetry), with an understanding of signal quality, scalability, and cost trade-offs.
· Strong incident management lifecycle aiming for improving area of alert quality, alert management, incident response, RCA, and postmortems.
· Experience with reliability engineering patterns such as canary deployments, automated rollback, capacity/right-sizing automation, and production operation.
· Solid data analysis, including SQL(e.g., PostgreSQL, MSSQL) and data pipelines.
· Data-driven mindset, able to perform deep research, analyze complex problems, and make informed technical decisions.
· Excellent communication and collaboration skills, able to explain complex technical concepts clearly to stakeholders at all levels, and to operate effectively both as a self-directed individual contributor and as part of a team.
· Curiosity and continuous learning, staying current with industry trends, open-source advancements, and emerging reliability practices.
Nice-to-Have:
Similar Jobs
Agoda
Lead Software Engineer, Devops domain (Bangkok based, relocation provided)
Roku
Senior Software Engineer, Infra
Roku
Senior Software Engineer, Search & Recommendations Platform
AB InBev
Senior Machine Learning Engineer - Bees Data
MongoDB
Senior Solutions Architect
MongoDB