Principal Site Reliability Engineer
UiPath
about 1 month ago
Tokyo, Japan
Staff+
H1B Sponsor
Responsibilities
- Lead Incident Command for high-stakes technical events.
- Serve as a key escalation point for complex issues.
- Own the communication life cycle during active incidents.
- Lead thorough retrospectives and drive automated self-healing solutions.
- Define and improve service health through SLIs and SLOs.
- Design automation to reduce manual intervention during incidents.
- Partner with development teams to promote service reliability.
- Mentor and support other engineers in SRE best practices.
Requirements
- 7+ years in SRE, Cloud Operations, or a related technical field.
- At least 3 years in a lead responder or command-oriented role.
- Demonstrated ability to remain calm and decisive under pressure.
- Strong proficiency in Python or Go and understanding of distributed systems.
- Deep experience with observability tools like Prometheus/Grafana.
- Willingness to participate in on-call rotations as an Incident Commander.
- Proficiency in English and Japanese for effective communication.
Tech Stack
AzureGoGrafanaKubernetesPrometheusPythonTerraform
Categories
AI & MLDevOps