GrepJob
DigiCert

Availability Engineer

DigiCert
Apply
about 14 hours ago
Bengaluru, IndiaMid Level / Senior
H1B Sponsor

Responsibilities

  • Own incident management practices across all production systems.
  • Act as the primary Incident Manager for high-priority production incidents.
  • Administer and optimize CI/CD pipelines for safe and frequent deployments.
  • Continuously improve incident response runbooks and escalation matrices.
  • Drive root cause analysis for major incidents and track action items.
  • Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
  • Establish and enforce SLA/SLO/SLI frameworks across production services.
  • Build automated runbooks and self-healing mechanisms.
  • Implement synthetic monitoring to detect customer-facing issues.
  • Utilize Splunk for incident investigation and observability.

Requirements

  • 4+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.
  • Hands-on experience leading incident response for high-severity production incidents.
  • Strong background in Linux systems administration and distributed systems troubleshooting.
  • Experience defining and managing SLOs, SLIs, and Error Budgets in production.

Benefits

  • Generous time off policies.
  • Top shelf benefits.
  • Education, wellness, and lifestyle support.

Tech Stack

BashDockerGitHub ActionsHarnessHelmKubernetesNagiosPythonSplunkTerraform

Categories