About the role

Aplyr's Quick Take

This role is focused on optimizing and enhancing the training processes for large language models at JetBrains. You'll be hands-on with GPU programming, profiling, and improving the performance of multi-node training pipelines. It's a technical, individual contributor role that requires deep expertise in machine learning frameworks and GPU architecture.

Good fit

Ideal candidates have strong backgrounds in PyTorch and GPU programming, with experience running large-scale distributed training jobs. You should be comfortable with complex technical challenges and enjoy problem-solving in a fast-paced environment.

Worth noting

The role involves cutting-edge work with large language models, which can be a unique opportunity for those looking to push the boundaries of AI. However, the job may require a steep learning curve if you lack experience with specific tools like Triton or CUDA.

At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create.

We’re looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family. Your job is easier said than done: make training faster, cheaper, and more stable at a large scale. You’ll profile, design, and implement changes to the training pipeline – from architecture to custom GPU kernels, as needed.

As part of our team, you will:

Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines.
Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc.
Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing).
Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible.
Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning.
Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing, strengthening reproducibility, and improving resilience to preemption.
Keep the data path fast using streaming and sharded data loaders and tokenizer pipelines, as well as improve overall throughput and cache efficiency.
Define the right metrics, build dashboards, and deliver steady improvements.
Run both pre-training and post-training (including SFT, RLHF, and GRPO-style methods) efficiently across sizable clusters.

We’ll be happy to bring you on board if you have:

Strong PyTorch and PyTorch Distributed experience, having run multi-node jobs with tens to hundreds of GPUs.
Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or serious FSDP/ZeRO expertise.
Real profiling expertise (Nsight Systems/Compute, nvprof) and experience with NVTX-instrumented workflows.
GPU programming skills with Triton and/or CUDA, and the ability to write, test, and debug kernels.
A solid understanding of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and how they show up in traces.

Our ideal candidate would have experience with:

FlashAttention-2 and 3, CUTLASS and CuTe, TransformerEngine and FP8, Inductor, AOTAutograd, and torch.compile.
MoE at scale (expert parallel, router losses, capacity management) and long-context tricks (ALiBi/YaRN/NTK scaling).
Kubernetes or SLURM at scale, placement and affinity tuning, as well as AWS, GCP, and Azure GPU fleets.
Web-scale data plumbing (streaming datasets, Parquet and TFRecord, tokenizer perf), eval harnesses, and benchmarking.
Safety and post-training methods, such as DPO, ORPO, GRPO, and reward models.
Inference ecosystems such as vLLM and paged KV.

#LI-KP1

We are an equal opportunity employer

We know great ideas can come from anyone, anywhere. That’s why we do our best to create an open and inclusive workplace – one that welcomes everyone regardless of their background, identity, religion, age, accessibility needs, or orientation.

We process the data provided in your job application in accordance with the Recruitment Privacy Policy.

Skills & Tags

node aws gcp azure kubernetes ai data product

Aplyr's read

JetBrains is a leader in intelligent development tools, attracting tech-savvy professionals passionate about enhancing developer productivity and innovation.
Synthesized from recent postings & public sources

What's promising

•JetBrains' tools like IntelliJ IDEA are industry standards, widely adopted by developers.
•The company is at the forefront of AI integration in development tools.
•JetBrains offers a diverse range of roles, from AI to product management.

What to watch

•JetBrains faces intense competition from other development tool providers.
•The company's rapid growth may lead to internal communication challenges.
•Limited public information about JetBrains' workplace culture and employee satisfaction.

Why JetBrains

•JetBrains emphasizes intelligent automation in developer tools, setting it apart.
•The company has a strong focus on AI-driven product development.
•JetBrains supports a wide array of programming languages, enhancing developer flexibility.

Aplyr’s read is generated by AI from public sources. Was it useful?

About JetBrains

JetBrains

jetbrains.com

View company

JetBrains is a software development company known for creating intelligent development tools that streamline programming and enhance productivity. Their products, such as IntelliJ IDEA and PyCharm, are widely used by developers around the world, significantly impacting the software development landscape.