4 months ago
San Francisco, CA, USASenior / Mid Level
Responsibilities
- Improve the reliability, availability, and operational health of production systems.
- Set observability standards across services including metrics, logs, and errors.
- Establish SLOs/SLIs, alerting, and on-call readiness with a focus on signal quality.
- Collaborate with engineers to design resilient systems and reduce operational risk early.
- Build internal tooling to enhance system safety, debugging, and developer velocity.
- Manage infrastructure using Pulumi across GCP, AWS, and Firebase.
Requirements
- 5+ years of SRE, DevOps, or production operations experience.
- At least 2 years of TypeScript web app development experience.
- Proven experience operating and scaling production systems with uptime and latency goals.
- Strong hands-on experience with observability stacks like Datadog or Sentry.
- Experience defining SLOs/SLIs and building effective alerting strategies.
- Proficiency with CI/CD systems and infrastructure-as-code.
- Experience with cloud-native and serverless platforms such as GCP and AWS.
- Strong cross-system debugging and incident response skills.
Tech Stack
AWSDatadogFirebaseGoogle Cloud PlatformTypeScript
