Senior Production Engineer

about 2 months ago

San Francisco, CA, USA or Sunnyvale, CA, USASenior

H1B Sponsor

Base Salary

$209k - $253k/yr

Responsibilities

Design and operate reliable managed AI services focused on LLM workloads.
Build automation and reliability tooling for distributed AI pipelines.
Define, measure, and improve SLIs/SLOs for AI workloads.
Collaborate with teams to optimize large-scale training and inference clusters.
Automate observability and performance tuning for latency-sensitive AI services.
Investigate and resolve reliability issues in distributed AI systems.
Contribute to the architecture of next-generation distributed systems.

Strong software engineering background with experience in production-grade systems.
Demonstrated experience in distributed systems design and implementation.
Hands-on work with large language models or AI/ML infrastructure.
Experience with defining and measuring SLIs/SLOs and building monitoring systems.
Proficiency in at least one modern programming language (Python, Go, Java, C++).
Familiarity with Kubernetes or container orchestration platforms.
Strong collaboration and communication skills.