Site Reliability Engineer
Confirmed live in the last 24 hours
Apple
Job Description
Summary
Collection of our people and their ideas encourage innovation in everything we do. Imagine what you could do here! Join Apple, and help us leave the world better than we found it. At Apple, new ideas have a way of becoming phenomenal products, services, and customer experiences very quickly. Every single day, people do amazing things at Apple. Do you want to be part of a team that builds cutting edge software service, a team that is continually innovating and is proud of making a difference? If so, bring your passion and talent and come join us to be part of something big and amazing. Join the AI and Data Platforms team at Apple, where we build and manage cloud-based data platforms handling petabytes of data at scale. We are looking for a passionate Software Engineer specializing in reliability engineering for data platforms, with a strong understanding of data and ML systems.
Description
As a Data Platform SRE, you will be responsible for developing and operating our big data platform using open source or other solutions to aid critical applications, such as analytics, reporting, and AI/ML apps. This includes working to optimize performance and cost, automate operations, and identifying and resolving production issues to ensure the best data platform experience
Minimum Qualifications
Experience: 5+ years in software site reliability engineering or software development roles. Programming: Proficient in at least one of Python, Golang, or Java. Skilled at coding for distributed systems and developing resilient data pipelines. Cloud Platforms: Hands-on experience with at least one major cloud platform (AWS, Azure, or Google Cloud Platform).
Preferred Qualifications
Expertise in designing, building, and operating critical, large-scale distributed systems with a focus on low latency, fault-tolerance, and high availability. Experience with contribution to Open Source projects is a plus. Experience with multiple public cloud infrastructure, managing multi-tenant Kubernetes clusters at scale and debugging Kubernetes/Spark issues. Experience with workflow and data pipeline orchestration tools (e.g., Airflow, DBT). Understanding of data modeling and data warehousing concepts. Familiarity with the AI/ML stack, including GPUs, MLFlow, or Large Language Models (LLMs). Data Structures & Algorithms: Strong foundation and application experience. Distributed Systems: Solid understanding and hands-on experience managing at least one distributed system (e.g. Kafka, Spark, Flink etc. ). Solid understanding of software engineering best practices, including the full development lifecycle, secure coding, and experience building reusable frameworks or libraries. Problem Solving: Demonstrated ability to independently troubleshoot and resolve complex technical issues. Creative Thinking: A track record of proposing and implementing innovative solutions to technical challenges.
Similar Jobs
S&P Global
Lead , Site Reliability Engineer
Red Hat
Senior Site Reliability Engineer, HyperShift (Golang, OpenShift/AWS, Linux)
Red Hat
Senior Site Reliability Engineer, HyperShift (Golang, OpenShift/AWS, Linux)
Zuora
Senior Site Reliability Engineer
Fivetran
Staff Site Reliability Engineer
Fivetran