3 months ago
Remote, United KingdomSenior
Responsibilities
- Design and maintain scalable cloud environments using Infrastructure as Code (IaC) with Terraform.
- Manage GPU/TPU resource allocation for training and fine-tuning.
- Build internal services and CLI tools to streamline the developer experience.
- Design CI/CD and training pipelines using tools like GitHub Actions and MLFlow.
- Develop reusable patterns for model serving and manage service deployments to Kubernetes.
- Optimize vector databases and embedding pipelines for RAG-based systems.
- Implement techniques to reduce latency and increase throughput for model inference.
- Solve scaling bottlenecks for serverless or containerized model deployments.
- Optimize GPU utilization and cloud spending without compromising performance.
- Define and create tooling for AI agent deployment and enable non-technical users.
Requirements
- 5+ years of experience with cloud infrastructure and infrastructure as code.
- Previous experience with the ML and LLM lifecycle including training and hosting.
- Experience working closely with researchers and data scientists.
- Strong understanding of ML fundamentals and modern GenAI stack.
Benefits
- Competitive salary and benefits.
- Remote working opportunities.
- Access to a unique human data platform for groundbreaking research.