Site Reliability Engineer - Big Data (7 to 11 years)
Confirmed live in the last 24 hours
PhonePe
Job Description
About PhonePe Limited:
Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.
PhonePe’s portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company’s vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.
Culture:
At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday!. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you’re excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us!
About the Role:
This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals.
Roles and Responsibilities:
- Manage, maintain, and support incremental changes to Linux/Unix environments.
- Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.
- Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters.
- Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies.
- Design and review scalable and reliable system architectures.
- Collaborate with teams to optimize overall system performance.
- Enforce security standards across systems and infrastructure.
- Set technical direction, drive standardization, and operate independently.
- Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
- Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring.
- Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience.
- Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
- Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.
- Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities.
- Develop and enforce SRE best practices and principles.
- Align across functional teams on priorities and deliverables.
- Drive automation to enhance operational efficiency.
Skills Required:
- Over 7 years of experience managing and maintaining distributed big data ecosystems.
- Strong expertise in Linux including IP, Iptables, and IPsec.
- Proficiency in scripting/programming with languages like Perl, Golang, or Python.
Similar Jobs
New Era Technology
Site Reliability Engineer (SRE)
AppOmni
Senior Site Reliability Engineer
Veeam Software
Site Reliability Engineer II
Axle Informatics
Site Reliability Engineer
DevRev
Site Reliability Engineer / Platform Engineer
Feedzai