Member of Technical Staff, AI Reliability & Monitoring Engineering Lead
Postman
5 months ago
San Francisco, CA, USA
Mid Level / Senior / Staff+
H1B Sponsor
Base Salary
$256k - $276k/yr
Responsibilities
- Develop and manage reliability metrics (SLOs) for AI-driven API services.
- Implement comprehensive observability and monitoring systems for real-time performance.
- Design automated failover, recovery, and incident response strategies.
- Optimize resource utilization, particularly GPU/accelerator efficiency.
- Collaborate with engineering, platform, and product teams on reliability efforts.
- Lead the development of internal tooling and automation for AI system stability.
- Drive continuous improvement in deployment practices and incident management.
Requirements
- Strong background in AI reliability engineering, SRE, or DevOps for distributed systems.
- Understanding of challenges in maintaining large-scale AI systems.
- Experience with cloud platforms, monitoring tools, and incident response automation.
- Ability to collaborate across teams to influence best practices.
- Comfortable in dynamic, fast-paced environments focused on reliable AI services.
Benefits
- Comprehensive medical coverage.
- Flexible PTO and wellness reimbursement.
- Monthly lunch stipend.
- Hybrid work model with in-office collaboration.
- Frequent team-building events and donation-matching program.
Categories
AI & MLDevOps