Lead Systems Reliability Engineer (Linux & Distributed Systems)

Confirmed live in the last 24 hours

The Trade Desk

London

On-site

Posted March 2, 2026

Job Description

The Trade Desk is a global technology company with a mission to create a better, more open internet for everyone through principled, intelligent advertising. Handling over 1 trillion queries per day, our platform operates at an unprecedented scale. We have also built something even stronger and more valuable: an award-winning culture based on trust, ownership, empathy, and collaboration. We value the unique experiences and perspectives that each person brings to The Trade Desk, and we are committed to fostering inclusive spaces where everyone can bring their authentic selves to work every day.

Do you have a passion for solving hard problems at scale? Are you eager to join a dynamic, globally- connected team where your contributions will make a meaningful difference in building a better media ecosystem? Come and see why Fortune magazine consistently ranks The Trade Desk among the best small- to medium-sized workplaces globally.

What we do

We are looking to hire a Lead Systems Reliability Engineer to join our engineering team to continue building and maintaining our data-driven platform. We leverage technologies like Aerospike, MongoDB, and Kafka to perform many real time activities, translating to with a p99 latency under 1 millisecond on the back end!

Do you enjoy tuning, performance testing, troubleshooting, automation, and operating at scale? Does testing next-gen hardware, evaluating data access patterns, and designing automation around distributed systems excite you?

What makes this role different:

First in the Industry: The Trade Desk is the first company to run over 5MM QPS to NVMe in Aerospike on a single node, forcing core software redesigns to achieve this scale.
Work on Cutting-Edge Hardware: Design clusters with nodes featuring 300TB of NVMe, 3TB RAM, and 512 cores, delivering a global 2,500GB/s throughput directly from flash.
Shape the Future of Infrastructure: Spec your own systems and collaborate directly with AMD and NoSQL vendors to run PoCs and optimize bleeding-edge technology for internet-scale workloads.
Deep Performance Engineering: Dive into kernel, hardware, and system interactions, leveraging tools like flamegraphs, NUMA counters, BIOS tuning, and synthetic testing to achieve world-class performance.
Push Hardware Endurance Limits: Build clusters engineered to withstand over 1 zettabyte of endurance.

What you’ll do:

Lead a team to influence, manage, and plan work streams, systems, and data structures at scale within a global ecosystem, spanning multiple infrastructure providers (cloud and traditional datacenters).
Encourage, improve, and build infrastructure automation in a way that works with stateful systems at scale.
Own operation

nodepythongorustkubernetesaiiosdatadesign