GrepJob
Stack AV

Senior Site Reliability Engineer

Stack AV
Apply
about 3 hours ago
Remote, Worldwide or Pittsburgh, PA, USASenior
H1B Sponsor

Responsibilities

  • Monitor and maintain mission-critical production services to ensure maximum uptime.
  • Design and implement scalable distributed systems to facilitate the development of self-driving vehicles.
  • Design and implement an incident management framework and build a culture of blameless postmortems and continuous learning.
  • Scale the reliability and velocity of our systems and processes through increased automation.
  • Document actions to build a comprehensive library of runbooks, which will act as a knowledge base and foundation for automation.
  • Participate in an on-call rotation to uphold the SLOs and SLAs of production services.

Requirements

  • Expertise in at least one scripting language (e.g. Bash, Python).
  • Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
  • Experience scaling and securing services in the cloud (AWS, GCP) or cloud native environments.
  • Experience using infrastructure-as-code principles to automate the creation of infrastructure resources (e.g. Terraform, CloudFormation).
  • Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget.
  • Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry, and Istio.
  • Strong communication skills and the ability to work effectively in a diverse and distributed team.