PhonePe

Site Reliability Engineer (7 to 11 years) - Big Data

PhonePe

Apply
9 months ago
Bengaluru, India
Senior / Staff+

Responsibilities

  • Manage, maintain, and support incremental changes to Linux/Unix environments.
  • Lead on-call rotations and incident responses, conducting root cause analysis.
  • Design and implement automation systems for managing big data infrastructure.
  • Troubleshoot and resolve complex production issues.
  • Design and review scalable and reliable system architectures.
  • Collaborate with teams to optimize overall system performance.
  • Enforce security standards across systems and infrastructure.
  • Set technical direction and drive standardization.
  • Ensure availability, performance, and scalability of systems.
  • Resolve and analyze system outages and disruptions.
  • Develop tools and scripts to automate operational processes.
  • Monitor and optimize system performance and resource usage.
  • Collaborate with development teams to integrate best practices.
  • Stay informed of industry technology trends and innovations.
  • Develop and enforce SRE best practices and principles.
  • Align across functional teams on priorities and deliverables.
  • Drive automation to enhance operational efficiency.

Requirements

  • Over 6 years of experience managing distributed big data ecosystems.
  • Strong expertise in Linux including IP, Iptables, and IPsec.
  • Proficiency in scripting/programming with languages like Perl, Golang, or Python.
  • Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).
  • Familiarity with open-source configuration management tools such as Puppet, Salt, Chef, or Ansible.
  • Solid understanding of networking and open-source technologies.
  • Excellent communication and collaboration skills.
  • Experience with DevOps tools: Saltstack, Ansible, Docker, Git.
  • Familiarity with SRE logging and monitoring tools: ELK stack, Grafana, Prometheus.

Benefits

  • Medical, Critical Illness, Accidental, and Life Insurance.
  • Employee Assistance Program and Onsite Medical Center.
  • Maternity and Paternity Benefits, Adoption Assistance, and Day-care Support.
  • Relocation benefits and Transfer Support Policy.
  • Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS.
  • Higher Education Assistance, Car Lease, and Salary Advance Policy.

Tech Stack

AnsibleApache AirflowApache HadoopApache HBaseApache KafkaAWSAzureChefDockerGitGoGoogle Cloud PlatformGrafanaLinuxPerlPrometheusPuppetPythonYarn

Categories

Data EngineeringDevOps