about 2 hours ago
Bengaluru, IndiaSenior / Staff+
H1B Sponsor
Responsibilities
- Lead the design and operation of scalable, production-grade cloud infrastructure for ML workloads across AWS and GCP.
- Architect and improve CI/CD systems for ML models and platform services.
- Own and evolve low-latency infrastructure for real-time model inference.
- Define and enforce observability standards for ML systems.
- Participate in on-call rotation for incident response and root-cause analysis.
- Partner with data scientists and ML engineers to enhance platform usability.
- Champion operational excellence through automation and continuous improvement.
Requirements
- BS or MS in Computer Science, Engineering, or a related quantitative field.
- 8+ years of experience in DevOps, SRE, or ML infrastructure.
- Strong programming skills in Python and/or Scala or Java.
- Deep experience with Kubernetes and container orchestration on GCP and/or AWS.
- Expertise with NoSQL or low-latency data stores.
- Hands-on experience with data and orchestration technologies.
- Experience building and maintaining CI/CD systems using tools like Jenkins or GitLab Runner.
- Familiarity with feature engineering platforms and model lifecycle tools.
- Strong infrastructure-as-code experience with Terraform.
- Experience with observability platforms such as Prometheus and Grafana.
- Excellent communication and cross-functional collaboration skills.
Benefits
- Global access to mental health and financial wellness support.
- Comprehensive healthcare benefits including medical, dental, and vision.
- Support for taking time off in accordance with local leave policies.
- Retirement options including 401(k)/pension.
Tech Stack
Apache AirflowApache FlinkApache KafkaApache SparkAWSDatadogGoogle Cloud PlatformGrafanaJavaJenkinsKubernetesMLflowPrometheusPythonScalaTerraform