AI/ML Research Scientist, LLM Post-Training & Evaluation

Confirmed live in the last 24 hours

Centific

Compensation

$150K - $180K

Redmond, Washington

On-site

Posted April 14, 2026

Job Description

About Centific

Centific is a frontier AI data foundry that curates diverse, high-quality data, using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe, scalable AI deployment. Our team includes more than 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We harness the power of an integrated solution ecosystem—comprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 markets—to create contextual, multilingual, pre-trained datasets; fine-tuned, industry-specific LLMs; and RAG pipelines supported by vector databases. Our zero-distance innovation™ solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster.

Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers. We aim to help these organizations unlock significant business value by deploying GenAI at scale, helping to ensure they stay at the forefront of technological advancement and maintain a competitive edge in their respective markets.

About Job

Key Responsibilities

Evaluation Framework Development: Design and validate comprehensive evaluation frameworks for LLM and multimodal systems, including benchmark and task design, automated scoring methods, model-assisted evaluation, human annotation protocols, and robustness testing across document types and modalities.
Multimodal Benchmark Research: Lead research into multimodal evaluation covering document understanding, table QA, image-text reasoning, and OCR-grounded extraction tasks. Develop benchmarks that measure model performance on structured and unstructured data sources representative of real enterprise workloads.
Fine-Tuning & Post-Training Experiments: Design and execute supervised fine-tuning (SFT) and preference optimization experiments to improve model performance on targeted tasks. Analyze how training objectives, dataset composition, and evaluation design interact to drive measurable model improvement.
RAG & Agentic System Evaluation: Develop evaluation protocols for retrieval-augmented generation (RAG) systems and agentic LLM pipelines, assessing retrieval quality, answer relevance, citation grounding, and multi-step reasoning fidelity across production-grade workflows.
Data Pipeline & Annotation Design: Architect data collection, annotation schema design, and quality control workflows for training and evaluation corpora. Define annotation guidelines, inter-rater agreement criteria, and adjudication procedures; build tooling to support annotator interfaces and real-time metric monitoring.
Model Behavior Analysis: Analyze model failure patterns across tasks and domains; generate actionable recommendations for evaluation redesign and fine-tuning strategy. Translate findings into practical improvements for customer solutions and Centific’s internal platforms.
Cloud-Native Evaluation Infrastructure: Collaborate with ML engineers to build scalable, containerized evaluation and fine-tuning pipelines on cloud platforms (AWS, GCP, or Azure). Integrate monitoring, logging, and experiment tracking to support reproducible research workflows.
Cross-Functional Collaboration: Partner with Language Data Scientists, ML engineers, and product teams to integrate human-in-the-loop evaluation, synthetic data strategies, and automated benchmarking into platform-level pipelines.
Customer Engagement: Engage with technical stakeholders at leading AI organizations to understand evaluation goals, review methodologies, and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
Knowledge & IP Creation: Contribute to internal benchmark datasets, reusable evaluation frameworks, and research assets. Produce technical documentation, research reports, and client-facing materials explaining methods, results, assumptions, and limitations.
Thought Leadership: Advance Centific’s position in LLM evaluation and multimodal AI through publications, conference presentations, and open-source benchmark contributions.

Core Technical Competencies

You will provide technical depth and leadership across the following domains:

Evaluation Science & Benchmarking

Expert-level benchmark dataset and test suite design for language and multimodal models
Deep understanding of metric design, scoring reliability, and measurement validity
Experience with human evaluation and quality assurance (rubric design, inter-rater reliability, adjudication)
Familiarity with precision-recall analysis, threshold tuning, and annotation-driven quality loops

Multimodal & Document AI

Experience with multimodal model evaluation across text, image, table, and document modalities
Familiarity with document understanding tasks: classification, extraction, structured QA, and OCR-based pipelines
Hands-on experience with vision-language models (VLMs), CLIP-style architectures, or transformer-based multimodal systems

LLM Systems & Post-Training

Strong understanding of post-training techniques (SFT, DPO, preference optimization) and how they interact with evaluation outcomes
Experience with LLM orchestration, RAG pipeline design, retrieval strategies (hybrid vector + BM25), and guardrail validation
Familiarity with agentic frameworks (e.g., LangChain, LangGraph) and multi-step reasoning evaluation

ML Engineering & Infrastructure

Strong Python skills for research experimentation, data processing, evaluation pipelines, and statistical analysis
Hands-on experience with ML frameworks (PyTorch, TensorFlow, Hugging Face) and cloud platforms (AWS, GCP, or Azure)
Comfort with containerized deployment (Docker, Kubernetes), experiment tracking, and CI/CD for research pipelines

Quantitative Analysis & Scientific Rigor

Strong statistical analysis skills: sampling, uncertainty quantification, significance testing, error analysis, and metric interpretation
Ability to synthesize complex experimental findings into concise, actionable recommendations for engineering and business stakeholders

Required Qualifications

Education: MS or PhD in Computer Science, Machine Learning, Data Science, Statistics, Applied Mathematics, AI, or a related quantitative field (PhD or strong MS research track preferred).
Research Experience: 3+ years of relevant experience in applied ML research or research science, with substantial work in LLMs, foundation models, or multimodal systems (graduate research counts).
LLM Evaluation Expertise: Demonstrated experience with LLM evaluation, benchmarking, post-training, or model quality research.
Multimodal Experience: Hands-on work with multimodal models or document AI systems, including tasks such as table QA, image-text reasoning, or OCR-based extraction.
Experimental Design: Strong foundation in experimental design, statistical analysis, and scientific reasoning applied to ML systems.
Technical Proficiency: Strong Python coding skills; experience with PyTorch, Hugging Face, or similar ML frameworks. Exposure to cloud infrastructure (AWS, GCP, or Azure) is a plus.
Communication: Strong written and verbal communication skills; able to present nuanced technical conclusions clearly to both research and non-technical audiences.

Preferred Qualifications

Post-Training Practice: Hands-on experience running SFT or preference optimization experiments with measurable evaluation outcomes.
RAG & Agentic Systems: Experience building or evaluating RAG pipelines, agentic LLM orchestration layers, or multi-turn interactive systems.
Document AI: Experience with document classification, information extraction, or structured QA over enterprise-scale document corpora.
Cloud & Deployment: Familiarity with containerized ML deployment (Docker/ECS), experiment logging (CloudWatch, MLflow), and scalable inference infrastructure.
Data Annotation Tooling: Experience designing annotation interfaces, real-time monitoring dashboards, or quality control tooling for ML data pipelines.
Scientific Contribution: Publications and/or open-source benchmark contributions in LLM evaluation, multimodal AI, post-training, or related areas at top venues (NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, etc.).
Applied Research Consulting: Experience in customer-facing applied research, technical consulting, or cross-functional product/research collaboration.
Safety & Governance: Familiarity with safety, trustworthiness, and governance considerations in GenAI evaluation.

How to Apply

Please send your CV, a summary of key research contributions (publications, benchmarks, or open-source work), and a brief statement on your evaluation or post-training philosophy to:

diana.moeck@centific.com

Subject Line: Research Scientist – LLM Evaluation & Post-Training

Salary: $150k-$180k