about 4 hours ago
Responsibilities
- Ensure the reliability of software systems by designing and maintaining scalable infrastructure.
- Develop automation tools and scripts to streamline operational tasks.
- Monitor system performance and respond to incidents promptly.
- Analyze system usage patterns and forecast future capacity needs.
- Identify and address performance bottlenecks in software systems.
- Implement infrastructure as code practices using tools like Terraform.
- Maintain monitoring and logging solutions for system insights.
- Collaborate with security teams to implement security best practices.
- Develop and maintain disaster recovery plans.
- Continuously analyze system performance for improvement opportunities.
- Provide mentorship and coaching to team members.
Requirements
- 10 - 15 years of experience in site reliability engineering.
- B.Tech/M.Tech in computer science, information technology, or a related field.
- Experience working for a product organization is a plus.
- Certifications from cloud service providers like AWS or Google Cloud are a plus.
- Proficiency in programming languages such as Python, Go, Shell, or Bash.
- Strong automation skills using tools like Ansible or Terraform.
- Experience with containerization technologies like Docker and Kubernetes.
- Proficiency in cloud platforms such as AWS, Azure, or Google Cloud.
- Familiarity with monitoring tools like Prometheus or Grafana.
- Understanding of networking concepts and security best practices.