Senior Software Engineer, AI Reliability Engineering
Anthropic
23 days ago
London, United Kingdom
Senior
H1B Sponsor
Responsibilities
- Develop Service Level Objectives for language model serving and training systems.
- Design and implement monitoring systems for availability and latency.
- Assist in creating high-availability language model serving infrastructure.
- Manage automated failover and recovery systems across multiple regions.
- Lead incident response for critical AI services.
- Build cost optimization systems for large-scale AI infrastructure.
Requirements
- Extensive experience with distributed systems observability and monitoring.
- Understanding of challenges in operating AI infrastructure.
- Proven experience with SLO/SLA frameworks for critical services.
- Comfortable with traditional and AI-specific metrics.
- Experience with chaos engineering and resilience testing.
- Ability to bridge gaps between ML engineers and infrastructure teams.
- Excellent communication skills.
Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Collaborative office space.
Categories
AI & MLDevOps