GrepJob
Okta

Staff Site Reliability Engineer

Okta
Apply
about 3 hours ago
Bengaluru, IndiaStaff+
H1B Sponsor

Responsibilities

  • Design, build, and operate large-scale cloud infrastructure and production services.
  • Participate in an on-call rotation supporting highly available customer-facing systems.
  • Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
  • Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
  • Partner with engineering teams to improve service availability, scalability, performance, and resilience.
  • Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.
  • Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
  • Eliminate operational toil through automation, tooling, and platform engineering.
  • Improve deployment safety and operational workflows through CI/CD and GitOps practices.
  • Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
  • Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.
  • Lead complex reliability initiatives spanning multiple engineering teams.
  • Guide engineers in adopting operational best practices and reliability engineering principles.
  • Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
  • Influence architecture and operational decisions through data-driven recommendations and engineering expertise.
  • Drive projects from conception through production rollout and long-term operational ownership.
  • Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.

Requirements

  • Strong experience operating large-scale production services in AWS and/or GCP.
  • Deep expertise with Kubernetes in production environments.
  • Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
  • Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
  • Strong software engineering skills in Golang and/or Python.
  • Experience building automation and internal engineering platforms.
  • Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, or similar technologies.
  • Strong understanding of cloud networking fundamentals including DNS, load balancing, ingress, TLS, service networking, and traffic management.
  • Experience with observability platforms, monitoring strategies, and production telemetry.
  • Experience with or strong interest in AI-assisted engineering and operational automation.
  • Strong expertise operating customer-facing production systems.
  • Experience leading incident response and driving operational improvements.
  • Deep understanding of reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning.
  • Strong understanding of CI/CD pipelines, deployment strategies, and automation-first operational practices.
  • Proven ability to balance reliability, scalability, security, and engineering velocity.