Back to Search
Overview
Staff

Staff Machine Learning Engineer, GenAI Platform

Confirmed live in the last 24 hours

Reddit

Reddit

Compensation

$253,300 - $354,600/year

Remote - United States
Remote
Posted April 1, 2026

Job Description

Reddit is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 121 million daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit www.redditinc.com.

Who We Are:
The Machine Learning Platform team at Reddit is a high-impact organization that owns the infrastructure powering recommendations, content discovery, and user quantification. As Generative AI becomes a strategic priority for Reddit, we are expanding our platform to meet the unique demands of foundation models. We are building the foundational infrastructure to support massive-scale, long-running LLM workloads, enabling teams across Growth, Ads, Feeds, and Core ML to move fast on shared, robust GenAI infrastructure.

What You’ll Do:
As a Staff Software Engineer on the Machine Learning Platform team, you will be a key technical leader architecting and scaling our Generative AI and LLM platform capabilities. Training and deploying foundation models places unprecedented demands on our systems. You will define the technical strategy and build the core infrastructure that enables machine learning engineers and researchers to seamlessly train, evaluate, and iterate on large language models at Reddit scale.

  • Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.
  • Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters. You will tackle challenges related to automated recovery, cluster-scale health monitoring, and advanced checkpointing to ensure optimal compute efficiency.
  • Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO). You will abstract away the complexity of distributed training frameworks, integrating them into a seamless platform SDK that handles configuration, experiment tracking, and model lifecycle management.
  • Develop Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability. You will build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns to ensure the quality and safety of models before they hit production.
  • Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads, optimizing for throughput and dynamic batching.
  • Provide Technical Leadership & Mentorship: Analyze complex bottlenecks in distributed systems to optimize for performance and cost-efficiency. Mentor senior engineers, champion a rigorous MLOps culture, and partner with cross-functional leadership to define technical roadmaps and de-risk major initiatives.

Who You Might Be:

  • 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.
  • GenAI/LLM Infrastructure Expertise: Proven
nodepythongorustkubernetesdockermachine learningaidataproduct