about 3 hours ago
Atlanta, GA, USASenior
Base Salary
$116k - $175k/yr
Responsibilities
- Engage with Engineering, Operations, and Product teams to maintain a highly available application platform.
- Build and implement application observability and platform monitoring tools.
- Automate processes and improve code to eliminate toil.
- Evaluate new ideas and trends for potential tools and techniques.
- Collaborate to identify gaps and resolve issues.
- Define and maintain SLIs and SLOs aligned with customer experience.
- Design and instrument SLIs such as latency and error rates.
- Manage and enforce error budgets to balance reliability and feature velocity.
- Improve alert quality by reducing noise and focusing on actionable alerts.
- Embed with product teams to catch reliability risks early.
- Share knowledge and findings with the engineering organization.
- Build scripts for operational automation and incident response.
Requirements
- Bachelor's degree in computer science, Engineering, or related field.
- 4+ years of application development experience with Java or equivalent.
- Experience with Spring environment.
- Experience in cloud-based infrastructure (Azure, AWS, GCP).
- Knowledge of factors affecting software application performance.
- Experience with observability tools (Datadog, Prometheus, Grafana).
- Knowledge of CI/CD pipelines and infrastructure-as-code (Terraform, Jenkins).
- Experience deploying AI systems in production.
- Familiarity with Kubernetes and container orchestration.
- Experience with distributed systems at scale.
Benefits
- Comprehensive healthcare coverage.
- Flexible PTO.
- Equity RSUs and annual performance bonus opportunities.
- Retirement account support.
- 14+ weeks of paid parental leave.
- Career development opportunities.
- Company-paid privacy certification exam fees.
Tech Stack
AWSAzureBashDatadogGoogle Cloud PlatformGrafanaHelmJavaJenkinsKubernetesPrometheusPythonRubySQLTerraform