about 3 hours ago
Base Salary
$209k - $253k/yr
Responsibilities
- Design and operate reliable managed AI services focused on LLM workloads.
- Build automation and reliability tooling for distributed AI pipelines.
- Define, measure, and improve SLIs/SLOs for AI workloads.
- Collaborate with teams to optimize large-scale training and inference clusters.
- Automate observability and performance tuning for latency-sensitive AI services.
- Investigate and resolve reliability issues in distributed AI systems.
- Contribute to the architecture of next-generation distributed systems.
Requirements
- Strong software engineering background with experience in production-grade systems.
- Demonstrated experience in distributed systems design and implementation.
- Hands-on work with large language models or AI/ML infrastructure.
- Experience with defining and measuring SLIs/SLOs and building monitoring systems.
- Proficiency in at least one modern programming language (Python, Go, Java, C++).
- Familiarity with Kubernetes or container orchestration platforms.
- Strong collaboration and communication skills.
Benefits
- Industry competitive pay.
- Restricted Stock Units in a fast-growing technology company.
- Health insurance options including HDHP and PPO, vision, and dental.
- Employer contributions to HSA accounts.
- Paid Parental Leave.
- Paid life insurance, short-term and long-term disability.
- 401(k) with a 100% match up to 4% of salary.
- Generous paid time off and holiday schedule.
- Tuition reimbursement.
- Company paid commuter benefit of $300 per month.
