Netskope

Staff SRE, Agentic AI

Netskope

Apply
3 months ago
Bengaluru, India
Staff+
H1B Sponsor

Responsibilities

  • Collaborate with AI/ML engineers to design and architect AI ML applications for scale and reliability.
  • Develop and deploy a CI/CD pipeline for safe and reproducible experiments.
  • Troubleshoot production issues related to AI ML application code and infrastructure configurations.
  • Set up and manage monitoring, logging, and alerting systems for training runs and APIs.
  • Ensure consistent availability of training environments across multiple clusters.
  • Manage containerization and orchestration systems using Docker and Kubernetes.
  • Oversee large Kubernetes clusters with GPU workloads.
  • Improve reliability, quality, and time-to-market of software solutions.
  • Measure and optimize system performance for continual improvement.
  • Provide operational support for large-scale distributed software applications.

Requirements

  • 8+ years of professional experience building core infrastructure systems.
  • Hands-on experience with core model training principles and frameworks like PyTorch and Hugging Face Transformers.
  • Familiarity with LLM development, deployment, and optimization techniques.
  • Experience with high-performance, large-scale ML systems and their infrastructure needs.
  • Experience with major cloud providers like Google Cloud, AWS, or Azure.
  • Proficiency with Infrastructure as Code (IaC) tools like Terraform.
  • Strong scripting skills in Python or Bash, and experience with Git and GitHub workflows.
  • Expertise in operating orchestration systems like Kubernetes at scale.
  • Experience with monitoring tools such as Prometheus and Grafana.
  • Proven track record of building and operating scalable, reliable, and secure systems.
  • Strong troubleshooting skills for complex systems.
  • Proactive in identifying problems and performance bottlenecks.
  • Comfortable working in a dynamic environment with ambiguity.

Tech Stack

AWSAzureBashDockerGitGoogle CloudGrafanaHugging Face TransformersKubernetesPrometheusPythonPyTorchTerraform

Categories

AI & MLDevOps