about 3 hours ago
Responsibilities
- Monitor and maintain mission-critical production services to ensure maximum uptime.
- Design and implement scalable distributed systems to facilitate the development of self-driving vehicles.
- Design and implement an incident management framework and build a culture of blameless postmortems and continuous learning.
- Scale the reliability and velocity of our systems and processes through increased automation.
- Document actions to build a comprehensive library of runbooks, which will act as a knowledge base and foundation for automation.
- Participate in an on-call rotation to uphold the SLOs and SLAs of production services.
Requirements
- Expertise in at least one scripting language (e.g. Bash, Python).
- Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
- Experience scaling and securing services in the cloud (AWS, GCP) or cloud native environments.
- Experience using infrastructure-as-code principles to automate the creation of infrastructure resources (e.g. Terraform, CloudFormation).
- Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget.
- Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry, and Istio.
- Strong communication skills and the ability to work effectively in a diverse and distributed team.