AI Engineer - Responsible AI

Confirmed live in the last 24 hours

Centific

Compensation

$110K - $120K

Redmond, Washington

On-site

Posted March 28, 2026

Job Description

About Centific

Centific is a frontier AI data foundry that curates diverse, high-quality data, using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe, scalable AI deployment. Our team includes more than 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We harness the power of an integrated solution ecosystem—comprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 markets—to create contextual, multilingual, pre-trained datasets; fine-tuned, industry-specific LLMs; and RAG pipelines supported by vector databases. Our zero-distance innovation™ solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster.

Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers. We aim to help these organizations unlock significant business value by deploying GenAI at scale, helping to ensure they stay at the forefront of technological advancement and maintain a competitive edge in their respective markets.

About Job

Key Responsibilities

Research Agenda & Experimentation: Define and execute a rigorous research agenda focused on LLM evaluation and post-training, with emphasis on evaluation-driven model improvement. Design experiments to study how evaluation methodologies impact fine-tuning and post-training outcomes.
Evaluation Framework Development: Develop and validate comprehensive evaluation frameworks for LLM and multimodal systems, covering benchmark and task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
Advanced Evaluation Research: Lead research on frontier evaluation domains including long-context, cross-modal, and dynamic multi-turn evaluations. Study effectiveness and limitations of existing techniques and propose improved methodologies with clear validity and scalability tradeoffs.
Linguistic Analysis & Language Engineering: Apply expertise in linguistics—including syntax, semantics, pragmatics, and discourse structure—to design linguistically grounded evaluation tasks that reveal authentic model capabilities and failure modes across diverse language phenomena and typological features.
Language Data Pipeline Design: Architect and oversee language data collection, annotation schema design, and quality control workflows. Define linguistic annotation guidelines, inter-annotator agreement criteria, and adjudication procedures to ensure high-quality training and evaluation corpora.
Cross-Lingual & Multilingual Evaluation: Design and lead evaluation initiatives for multilingual and cross-lingual model capabilities, including transfer learning assessments, low-resource language benchmarking, and culturally sensitive task design across diverse language families.
Corpus & Benchmark Curation: Build and maintain linguistically diverse benchmark datasets that reflect real-world language use, covering registers, dialects, code-switching, domain-specific terminology, and pragmatic variation. Ensure benchmarks are free from systematic linguistic bias.
Model Behavior Analysis: Analyze model behavior and failure patterns at the linguistic and semantic level; generate actionable recommendations for model improvement and evaluation redesign. Translate findings into practical improvements for customer solutions and Centific’s internal platforms.
Cross-Functional Collaboration: Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies, and with AI/ML Research Engineers to translate research methods into scalable evaluation and post-training pipelines.
Customer Engagement: Engage with customer technical stakeholders at leading AI organizations to understand evaluation goals, review methodologies, and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
Knowledge & IP Creation: Contribute to internal benchmark datasets, reusable evaluation frameworks, and research assets. Produce high-quality technical documentation, internal research reports, and client-facing materials explaining methods, results, assumptions, and limitations.
Thought Leadership: Contribute to Centific’s position as a leader in LLM evaluation and post-training through publications, conference presentations, and open-source contributions in evaluation science, linguistics, and language technology.

Core Technical Competencies

You will provide technical depth and leadership across the following domains:

Evaluation Science & Benchmarking

Expert-level benchmark dataset and test suite design for language and multimodal models
Deep understanding of metric design, scoring reliability, and measurement validity
Experience with human evaluation methods and quality assurance (rubric design, inter-rater reliability, adjudication frameworks)

Linguistics & Language Engineering

Formal training or equivalent expertise in linguistics, including syntax, morphology, semantics, pragmatics, and discourse analysis
Hands-on experience in computational linguistics, NLP pipeline development, or corpus linguistics for large-scale language data
Ability to design linguistically motivated evaluation tasks and annotation schemes that capture authentic language variation
Familiarity with linguistic annotation standards (e.g., Universal Dependencies, PropBank, AMR) and corpus management tools
Experience with language typology and cross-lingual phenomena relevant to multilingual model evaluation

LLM & Post-Training Methods

Strong understanding of post-training techniques (SFT, DPO, preference optimization) and how training objectives interact with evaluation outcomes
Ability to reason about model behavior, failure modes, and performance tradeoffs across tasks and domains
Familiarity with alignment, safety, and robustness considerations in model evaluation

Quantitative Analysis & Scientific Rigor

Strong statistical analysis skills: sampling, uncertainty quantification, significance testing, error analysis, metric interpretation
Ability to synthesize complex experimental findings into concise, actionable recommendations for engineering and business stakeholders

Required Qualifications

Education: MS or PhD in Computer Science, Machine Learning, Computational Linguistics, Linguistics, Statistics, Applied Mathematics, AI, or a related quantitative field (PhD strongly preferred).
Research Experience: 5+ years of relevant experience in applied ML research or research science, with substantial work in LLMs or foundation models (graduate research counts).
LLM Evaluation Expertise: Demonstrated experience with LLM evaluation, benchmarking, alignment, post-training, or model quality research.
Linguistic Expertise: Formal background in linguistics or language engineering, with practical experience applying linguistic frameworks to NLP or LLM evaluation tasks.
Experimental Design: Strong foundation in experimental design, statistical analysis, and scientific reasoning for ML systems.
Technical Proficiency: Strong Python coding skills for research experimentation, data processing, evaluation pipelines, statistical analysis, and visualization. Hands-on experience with modern ML frameworks (PyTorch, Hugging Face, JAX/TensorFlow).
Evaluation Methodology: Ability to evaluate and compare human and automated evaluation methods, including tradeoffs in cost, reliability, validity, and scalability. Experience designing reproducible evaluation studies across datasets and model versions.
Communication: Strong written and verbal communication skills; able to present nuanced technical conclusions, assumptions, and limitations clearly to both research and non-technical audiences.

Preferred Qualifications

Post-Training Practice: Hands-on experience running fine-tuning or post-training experiments (SFT, preference optimization workflows).
Multimodal & Long-Context: Experience with multimodal evaluation (text-image, audio, video) and long-context benchmarking in real-world settings.
Agentic Evaluation: Experience designing multi-turn, interactive, or agentic evaluation protocols.
Multilingual & Low-Resource NLP: Experience building evaluation datasets or benchmarks for multilingual, low-resource, or under-represented language settings.
Corpus Linguistics & Annotation: Experience leading large-scale annotation projects, defining linguistic guidelines, and applying inter-annotator agreement methodologies (Cohen’s kappa, Krippendorff’s alpha).
Language Generation Evaluation: Familiarity with reference-based and reference-free evaluation of text generation quality, including coherence, fluency, factuality, and stylistic appropriateness.
Scientific Contribution: Publications and/or open-source benchmark contributions in LLM evaluation, post-training, alignment, computational linguistics, or related areas at top venues (NeurIPS, ICML, ICLR, ACL, EMNLP, NAACL, etc.).
Applied Research Consulting: Experience in customer-facing applied research, technical consulting, or cross-functional product/research collaboration.
Safety & Governance: Familiarity with safety, trustworthiness, and governance considerations in GenAI evaluation.

How to Apply

Please send your CV, a summary of key research contributions (publications, benchmarks, or open-source work), and a brief statement on your evaluation or post-training philosophy to:

diana.moeck@centific.com

Subject Line: Research Scientist – LLM Evaluation & Post-Training

Salary: $110k-$120k