Senior MLOps Engineer

Confirmed live in the last 24 hours

Thoughtworks

Singapore, Singapore

On-site

Posted April 3, 2026

Job Description

Due to the project requirement, candidates must be Singaporean citizens or already hold Singaporean Permanent Residency (PR) at the time of application.

As an MLOps Engineer in DAMO service line, you will be responsible for ensuring the reliability, safety, performance and continuous improvement of large-scale machine learning and AI systems in production, including both generative AI and traditional ML systems like computer vision and recommendation models. You will work across the full software delivery lifecycle, contributing to design, implementation, deployment and ongoing operational excellence.

You will champion engineering best practices, including clean and maintainable code, test-driven development, continuous delivery, strong observability and collaborative development through pairing and code reviews. You will stay hands-on, actively contributing to codebases and applying modern practices from the Thoughtworks Technology Radar.

You will design pragmatic solutions that balance technical constraints, cost efficiency, performance and system safety. Working closely with developers, data scientists, platform engineers and product teams, you will help deliver production-ready AI capabilities that meet business needs and uphold a high bar for quality.

You will also play an active role in fostering a collaborative, inclusive team culture, encouraging feedback and supporting the growth of team members.

Job responsibilities

You will design, implement and maintain monitoring and alerting for ML and AI operational signals, including model performance degradation (for all model types, e.g., computer vision, recommendation, GenAI), data drift, latency issues, and anomalies. This includes specific monitoring for GenAI aspects like prompt failures, hallucination trends, guardrail violations, and overall agent workflow health.
You will build and operate robust evaluation and testing pipelines for all ML and AI systems, including automated regression tests for models (e.g., accuracy, precision, recall for traditional ML), prompts, workflows, tools and model versions, ensuring new releases meet or exceed established baselines.
You will investigate and resolve production issues related to model behaviour, including troubleshooting ML models (e.g., deep learning models for computer vision, collaborative filtering for recommendation), tool-calling errors, vector search/RAG retrieval failures (for GenAI), data quality issues, and integration points across the system.
You will collaborate with infrastructure and platform teams to ensure stable, performant and cost-efficient AI inference, including optimisation of deployment strategies, resource usage and runtime configurations.
You will manage the lifecycle of ML models, prompts, embeddings, vector indices and associated components, including controlled rollouts, versioning strategies, and automated evaluation gates.
You will design and operate effective feedback loops that incorporate real user interactions, evaluation metrics, UAT findings and domain expert reviews, enabling continuous improvement of all ML/AI systems, including agentic systems.
You will uphold governance, safety and compliance standards, ensuring observability, auditability, privacy protection and adherence to organisational guidelines for all ML/AI systems and data handling.
You will maintain clear, comprehensive documentation covering operational procedures, system behaviours, incident findings, performance benchmarks and deployment practices.
You will communicate system health, risks, upcoming changes and operational insights clearly to technical and non-technical audiences.
You will support the growth and development of junior team members through guidance, knowledge sharing and constructive feedback.