about 4 hours ago
Base Salary
$140k - $288k/yr
Responsibilities
- Design and build AI agents to assist with service health analysis and reliability recommendations.
- Lead large-scale infrastructure modernization using AI to accelerate delivery and create self-service patterns.
- Transform consulting patterns into scalable platforms and tools for engineering teams.
- Create operational documentation and best practices to democratize reliability expertise.
- Develop software solutions to enhance the reliability of distributed systems.
- Build automation frameworks to reduce operational overhead and improve efficiency.
- Develop service level indicators to monitor system health and inform reliability decisions.
- Automate engineering processes to minimize risk and enhance innovation speed.
- Manage capacity and performance to optimize resource utilization across cloud infrastructures.
Requirements
- 5+ years of experience in building and operating large-scale distributed systems.
- Bachelor's degree in Computer Science or related field, or equivalent experience.
- Strong programming skills in Python or Go for building production-grade platforms.
- Deep knowledge of Linux/Unix internals and experience with open source infrastructure.
- Experience with Infrastructure as Code tools like Terraform, Puppet, or Kubernetes.
- Experience deploying web applications to cloud infrastructure such as AWS, GCP, or Azure.
- Preferred experience in developing AI agents for infrastructure automation.
- Experience with AI/ML infrastructure and technical consulting is a plus.
Tech Stack
AmbassadorAnsibleApache HadoopApache KafkaAWSAzureChefDockerGoGoogle Cloud PlatformKubernetesLinuxMySQLPuppetPythonTerraform