
Site Reliability Engineer - Platform Engineering
CodeRabbitabout 1 month ago
Responsibilities
- Design, implement, and maintain scalable infrastructure on Google Cloud Platform.
- Develop, own, and operate critical platform services.
- Build and maintain Infrastructure as Code using Terraform-Terragrunt.
- Establish and maintain SLI/SLO frameworks for critical services.
- Implement monitoring, alerting, observability, and incident management solutions.
- Conduct incident response and root cause analysis.
- Optimize application and infrastructure performance and cost.
- Design and implement chaos engineering practices.
- Develop self-service platforms and tooling for engineering teams.
- Automate operational tasks including scaling and security patching.
- Create and maintain infrastructure APIs and abstractions.
- Integrate security best practices into infrastructure and platform services.
- Implement security monitoring and compliance reporting.
- Design secure network architectures and establish disaster recovery procedures.
Requirements
- 6-8 years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles.
- Proven track record of managing production systems at scale.
- Strong background with cloud platforms, particularly GCP or AWS.
- Experience in containerization and orchestration platforms like Kubernetes and Docker.
- Proficiency in Node.js and TypeScript for building automation tools.
- Advanced experience with Terraform for infrastructure management.
- Hands-on experience with monitoring platforms like Datadog.
- Strong Linux/Unix systems skills.
- Knowledge of security principles for cloud infrastructure.
- Familiarity with CI/CD tools and practices.
Benefits
- Work on cutting-edge technology with real-world impact.
- Collaborative and innovative environment.
- Competitive salary, equity, and benefits.
- Professional development opportunities.