Staff Site Reliability Engineer

Okta

Apply

about 3 hours ago

Bengaluru, IndiaStaff+

H1B Sponsor

Responsibilities

Design, build, and operate large-scale cloud infrastructure and production services.
Participate in an on-call rotation supporting highly available customer-facing systems.
Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Partner with engineering teams to improve service availability, scalability, performance, and resilience.
Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.
Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
Eliminate operational toil through automation, tooling, and platform engineering.
Improve deployment safety and operational workflows through CI/CD and GitOps practices.
Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.
Lead complex reliability initiatives spanning multiple engineering teams.
Guide engineers in adopting operational best practices and reliability engineering principles.
Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
Influence architecture and operational decisions through data-driven recommendations and engineering expertise.
Drive projects from conception through production rollout and long-term operational ownership.
Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.

Requirements

Strong experience operating large-scale production services in AWS and/or GCP.
Deep expertise with Kubernetes in production environments.
Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
Strong software engineering skills in Golang and/or Python.
Experience building automation and internal engineering platforms.
Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, or similar technologies.
Strong understanding of cloud networking fundamentals including DNS, load balancing, ingress, TLS, service networking, and traffic management.
Experience with observability platforms, monitoring strategies, and production telemetry.
Experience with or strong interest in AI-assisted engineering and operational automation.
Strong expertise operating customer-facing production systems.
Experience leading incident response and driving operational improvements.
Deep understanding of reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning.
Strong understanding of CI/CD pipelines, deployment strategies, and automation-first operational practices.
Proven ability to balance reliability, scalability, security, and engineering velocity.

Tech Stack

AWSDatadogGit Go Google Cloud PlatformHelmKubernetes PostgreSQL PythonRedisSplunkTerraform

Staff Site Reliability Engineer

Responsibilities

Requirements

Tech Stack

Categories