Staff SRE, Agentic AI
Netskope
3 months ago
Bengaluru, India
Staff+
H1B Sponsor
Responsibilities
- Collaborate with AI/ML engineers to design and architect AI ML applications for scale and reliability.
- Develop and deploy a CI/CD pipeline for safe and reproducible experiments.
- Troubleshoot production issues related to AI ML application code and infrastructure configurations.
- Set up and manage monitoring, logging, and alerting systems for training runs and APIs.
- Ensure consistent availability of training environments across multiple clusters.
- Manage containerization and orchestration systems using Docker and Kubernetes.
- Oversee large Kubernetes clusters with GPU workloads.
- Improve reliability, quality, and time-to-market of software solutions.
- Measure and optimize system performance for continual improvement.
- Provide operational support for large-scale distributed software applications.
Requirements
- 8+ years of professional experience building core infrastructure systems.
- Hands-on experience with core model training principles and frameworks like PyTorch and Hugging Face Transformers.
- Familiarity with LLM development, deployment, and optimization techniques.
- Experience with high-performance, large-scale ML systems and their infrastructure needs.
- Experience with major cloud providers like Google Cloud, AWS, or Azure.
- Proficiency with Infrastructure as Code (IaC) tools like Terraform.
- Strong scripting skills in Python or Bash, and experience with Git and GitHub workflows.
- Expertise in operating orchestration systems like Kubernetes at scale.
- Experience with monitoring tools such as Prometheus and Grafana.
- Proven track record of building and operating scalable, reliable, and secure systems.
- Strong troubleshooting skills for complex systems.
- Proactive in identifying problems and performance bottlenecks.
- Comfortable working in a dynamic environment with ambiguity.
Tech Stack
AWSAzureBashDockerGitGoogle CloudGrafanaHugging Face TransformersKubernetesPrometheusPythonPyTorchTerraform
Categories
AI & MLDevOps