Back to Search
Overview
Mid-Level

AI Platform Engineer

Confirmed live in the last 24 hours

Guardian Life

Guardian Life

Chennai
On-site
Posted March 31, 2026

Job Description

Job Description:

AI Platform Engineer (3–5 years)

Role Summary

We are looking for an AI Platform Engineer to build and scale the core platform that supports traditional ML models as well as modern LLM and generative AI workloads. This role focuses on production-grade MLOps, model lifecycle management, platform reliability, security, and self-service enablement for data scientists and engineering teams.

Key Responsibilities

1. Platform Engineering

· Design and operate a scalable AI/ML platform using Kubernetes, containers, and infra-as-code.

· Build reusable frameworks for model training, fine-tuning, batch and real-time inference, and RAG pipelines.

· Implement multi-tenant isolation, quotas, and cost-tracking.

2. MLOps & Model Lifecycle Automation

· Develop CI/CD/CT pipelines for models, prompts, and data.

· Manage model registry, feature store, lineage, and experiment tracking.

· Ensure reliable production rollout using blue-green, canary, and shadow deployments.

3. Data & Pipelines

· Build scalable data and model pipelines using orchestrators like Airflow, Prefect, Dagster, or Argo.

· Implement data validation and schema enforcement.

· Optimize storage, caching, indexes, embeddings, and vector search workflows.

4. Observability & Reliability

· Set up monitoring for data drift, model drift, prompt performance, latency, accuracy, and cost.

· Define SLOs, SLIs, and incident response patterns.

· Implement logging, tracing and metrics using Prometheus, Grafana, OpenTelemetry, or similar tools.

5. Security & Governance

· Enforce secrets management, IAM controls, network security, and auditability.

· Implement model governance, model cards, prompt controls, and risk guardrails.

· Work with security to ensure PII and compliance adherence.

6. Performance & Cost Optimization

· Optimize compute, autoscaling, GPU usage, caching, and batching.

· Track cost per model, per workload, and per team for transparency.

· Implement model optimization (quantization, distillation, caching).

7. Enablement & Developer Experience

· Create templates, SDKs, CLI tools, documentation, and best practices.

· Help data scientists and developers move models to production quickly.

· Partner with architecture, cybersecurity, and product teams.

Must-Have Skills

· 3–5 years of experience, with 2+ years in ML platform/MLOps.

· Strong Python development skills.

· Experience with Kubernetes, Docker, Helm.

· Infra-as-code: Terraform, Pulumi, CloudFormation or similar.

· CI/CD systems like GitHub Actions, GitLab CI, Azure DevOps, Jenkins.

· Experience with one or more ML platforms:

o MLflow, Kubeflow, Azure ML, Vertex AI, SageMaker, Ray, BentoML, W&B.

· Strong understanding of model lifecycle, deployment patterns, and monitoring.

· Experience with vector databases, feature stores, artifact registries.

· Familiarity with observability stacks (Prometheus, Grafana, Loki, OpenTelemetry).

· Strong understanding of security for data and ML workloads.

Good-to-Have Skills

· Experience with LLM serving frameworks: vLLM, Triton, Ray Serve, OpenAI/Anthropic APIs.

· Experience building RAG systems with vector DBs: FAISS, Milvus, Pinecone.

· Understanding of data engineering tools like Spark, Flink, Kafka.

· GPU optimization (CUDA, TensorRT, ONNX).

· Background in cost governance (FinOps for AI).

· Experience building internal SDKs, CLIs, or developer tools.

· Knowledge of privacy frameworks and governance models.

Education

B.E / B.Tech / M.E / M.Tech in Computer Science, IT, or equivalent hands-on experience.

Location:

This position can be based in any of the following locations:

Chennai

Current Guardian Colleagues: Please apply through the internal Jobs Hub in Workday

ai