Site Reliability Engineer (7 to 11 years) - Big Data
PhonePe
9 months ago
Bengaluru, India
Senior / Staff+
Responsibilities
- Manage, maintain, and support incremental changes to Linux/Unix environments.
- Lead on-call rotations and incident responses, conducting root cause analysis.
- Design and implement automation systems for managing big data infrastructure.
- Troubleshoot and resolve complex production issues.
- Design and review scalable and reliable system architectures.
- Collaborate with teams to optimize overall system performance.
- Enforce security standards across systems and infrastructure.
- Set technical direction and drive standardization.
- Ensure availability, performance, and scalability of systems.
- Resolve and analyze system outages and disruptions.
- Develop tools and scripts to automate operational processes.
- Monitor and optimize system performance and resource usage.
- Collaborate with development teams to integrate best practices.
- Stay informed of industry technology trends and innovations.
- Develop and enforce SRE best practices and principles.
- Align across functional teams on priorities and deliverables.
- Drive automation to enhance operational efficiency.
Requirements
- Over 6 years of experience managing distributed big data ecosystems.
- Strong expertise in Linux including IP, Iptables, and IPsec.
- Proficiency in scripting/programming with languages like Perl, Golang, or Python.
- Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).
- Familiarity with open-source configuration management tools such as Puppet, Salt, Chef, or Ansible.
- Solid understanding of networking and open-source technologies.
- Excellent communication and collaboration skills.
- Experience with DevOps tools: Saltstack, Ansible, Docker, Git.
- Familiarity with SRE logging and monitoring tools: ELK stack, Grafana, Prometheus.
Benefits
- Medical, Critical Illness, Accidental, and Life Insurance.
- Employee Assistance Program and Onsite Medical Center.
- Maternity and Paternity Benefits, Adoption Assistance, and Day-care Support.
- Relocation benefits and Transfer Support Policy.
- Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS.
- Higher Education Assistance, Car Lease, and Salary Advance Policy.
Tech Stack
AnsibleApache AirflowApache HadoopApache HBaseApache KafkaAWSAzureChefDockerGitGoGoogle Cloud PlatformGrafanaLinuxPerlPrometheusPuppetPythonYarn
Categories
Data EngineeringDevOps