10 days ago
New York, NY, USAMid Level / Senior
Base Salary
$145k - $165k/yr
Responsibilities
- Define and own the long-term ML infrastructure roadmap.
- Establish best practices for model lifecycle management and deployment standards.
- Identify infrastructure gaps and design scalable solutions.
- Design, build, and maintain production-grade model deployment systems.
- Automate end-to-end ML lifecycle workflows.
- Implement robust monitoring systems for model performance and infrastructure health.
- Operate across AWS and GCP environments for managing workloads.
- Develop and maintain infrastructure-as-code for secure cloud environments.
- Implement and optimize CI/CD workflows for ML automation.
- Collaborate with cross-functional teams to support ML workflows.
- Stay current on emerging ML Ops practices and tools.
Requirements
- 4+ years of experience in ML Ops, ML infrastructure, or backend engineering.
- Experience in cloud-native environments (AWS and/or GCP).
- Proven track record in designing and implementing CI/CD pipelines for ML systems.
- Strong experience with Amazon SageMaker, Docker, and Flask-based APIs.
- Hands-on experience with ML lifecycle tooling like MLflow or SageMaker Studio.
- Experience managing container orchestration platforms like Kubernetes.
- Strong programming experience in Python; additional languages like Go or Java are a plus.
- Experience with infrastructure-as-code tools such as Terraform or CloudFormation.
- Familiarity with observability tools like CloudWatch or Prometheus.
- Experience managing GPU-based workloads.
- Familiarity with data infrastructure tools like BigQuery.
- Bonus: Experience with LLMs, generative AI pipelines, or ML governance frameworks.