about 2 months ago
Responsibilities
- Own the reliability, scalability, and operational health of Gradial’s production platform.
- Lead the evolution of Kubernetes, CI/CD, observability, and infrastructure as code.
- Set the standard for designing, shipping, and operating reliable systems.
- Build tooling and automation to help engineers work faster and with more confidence.
- Drive improvements in monitoring, alerting, incident response, and service readiness.
- Partner with engineering to identify and solve scaling risks early.
- Influence the long-term direction of the platform across reliability, security, performance, and cost.
Requirements
- 5+ years of experience in platform engineering, infrastructure, SRE, DevOps, or related roles.
- Proven success in designing and operating production-grade infrastructure in high-growth environments.
- Deep expertise in Kubernetes, cloud-native architecture, and container orchestration.
- Strong experience with infrastructure as code, GitOps, CI/CD workflows, and modern deployment practices.
- Strong command of observability and reliability fundamentals.
- A track record of leading through influence and making sound technical decisions.
Benefits
- Meaningful equity and competitive salary.
- Comprehensive health, dental, and vision coverage.
- Fast-paced environment with autonomy and ownership.
- Real impact with zero bureaucracy.
- A front-row seat to building category-defining AI infrastructure.
