about 3 hours ago
San Francisco, CA, USA
Staff+
H1B Sponsor
Base Salary
$200k - $230k/yr
Responsibilities
- Shape the engineering organization standards around observability.
- Own and evolve the observability platform, including distributed logging, metrics, and tracing infrastructure.
- Build AI-native capabilities to automatically detect anomalies, diagnose failures, and accelerate root cause analysis.
- Create powerful developer experiences through dashboards, notebooks, and interactive debugging tools.
- Drive reliability automation with intelligent alerting, diagnostics, and incident response systems.
- Partner across engineering teams to embed observability and reliability best practices.
- Mentor engineers and influence reliability culture across the organization.
Requirements
- Have 8+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure.
- Experience implementing or operating observability platforms such as Datadog, Sentry, Splunk, or similar.
- Have strong SWE coding proficiency in at least one of Ruby, Python, or TypeScript.
- Strategic systems thinker who identifies high impact opportunities and builds scalable solutions.
- Experience operating large scale distributed systems in production, especially logging platforms or time series databases.
- Strong fundamentals in systems, networking, and cloud infrastructure such as Kubernetes and AWS.
- Thrive in ambiguous environments and roll up your sleeves to solve unscoped problems end to end.
- Strong communicator who can align technical and non-technical stakeholders.
- Bonus if you have built or contributed to observability ecosystems such as OpenTelemetry or Prometheus.
Tech Stack
AWSDatadogKubernetesLinuxPrometheusPythonRubySplunkTerraformTypeScript
Categories
AI & MLBackendData ScienceDevOps