Site Reliability Engineer - Kubernetes - Data Platforms

Confirmed live in the last 24 hours

Adyen

Amsterdam

Remote

Posted April 14, 2026

Job Description

This is Adyen

Adyen provides payments, data, and financial products in a single solution for customers like Meta, Uber, H&M, and Microsoft - making us the financial technology platform of choice. At Adyen, everything we do is engineered for ambition.

For our teams, we create an environment with opportunities for our people to succeed, backed by the culture and support to ensure they are enabled to truly own their careers. We are motivated individuals who tackle unique technical challenges at scale and solve them as a team. Together, we deliver innovative and ethical solutions that help businesses achieve their ambitions faster.

Platform/Site Reliability Engineer - Kubernetes & Big Data

You will be building the rails of a self-service data platform inside Adyen, creating an ecosystem that is bigger than the sum of its parts. By blending Site Reliability Engineering, Software Engineering, Systems Engineering, and Data Engineering, you will power the many data, machine learning, and GenAI products running across Adyen.

You’ll be joining a dedicated team of 9 engineers—split between kubernetes cluster management and the core services running on top of them. We work in a flexible, Kanban-style environment, sitting right in the middle of our users. This proximity gives us a direct feedback loop, allowing us to build impactful solutions for both the "happy flow" and the "sad flow."

Beyond operations, you’ll have the opportunity to design, build, and scale infrastructure from the ground up on our on-premise environments—solving problems typically handled by managed cloud providers yourself. If you thrive on tackling real-life challenges, reducing manual toil through automation, and want unparalleled growth opportunities in SWE, Systems, or Data Engineering, this is your team.

What you’ll do

Design & Build On-Premise (kubernetes) Infrastructure: Architect and scale modern, cloud-like services from the ground up on our on-premise infrastructure, managing core foundational layers including DNS, TLS, certification management, load balancers, and deep troubleshooting.
Cluster Provisioning & Reliability: Build, maintain, and scale new Kubernetes clusters and Big Data services. You will maintain agreed SLOs, ensure high availability, and support end-users by keeping them unblocked.
Mixed Workload Balancing: Prevent resource starvation by ensuring massive batch compute and ML training jobs do not consume resources required by critical, user-facing GenAI inference services and API gateways.
Advanced Scheduling & Hardware Management: Enforce strict priority, preemption, and specialized scheduling policies (such as gang scheduling). Orchestrate diverse hardware profiles, managing GPU node pools, drivers, device plugins, and resource slicing to support intensive ML/AI processing.
Storage & Network Optimization: Scale stateful workloads, Persistent Volumes (PVs), and high-throughput networking interfaces to handle massive data gravity and mitigate I/O bottlenecks.
FinOps & Security: Implement intelligent autoscaling and interruptible instance management to control bursty infrastructure costs. Apply strict resource quotas, RBAC, and network policies to prevent "noisy neighbor" disruptions and guarantee secure isolation across different tenant teams.
Automation & Operations: Dedicate time to the development of new features, applying releases, and building automations that eliminate unacceptable toil. Participate in an expanding 24x7 on-call roster to support the platform.

Who you are

Experienced Platform/SRE Professional: You have a strong background in System Administration and Kubernetes management, with proven experience building and operating distributed systems.
Technical Expertise: You have hands-on experience with K8s, Linux, and foundational networking (DNS, TLS, Load Balancing, ArgoCD, GitOps).
Tooling & Ecosystems: You are highly proficient with configuration management and/or networking tools (Ansible, Puppet, Cilium, HAProxy, Nginx) and/or distributed storage and data systems (Ha

nodegokubernetesmachine learningaidataproductdesign

Similar Jobs

Morgan Stanley

Site Reliability Engineer (Infrastructure Applications) - Director P3 - ETS

Lead / ManagerHong Kong, Hong Kong

Fidelity Investments

Principal AI Site Reliability Engineer, EI Production Services

Principal2 Locations

JLL

Site Reliability Engineer | AI Infrastructure

Mid-LevelTEL AVIV, ISR