about 2 months ago
Singapore, SingaporeMid Level / Senior
H1B Sponsor
Responsibilities
- Design and scale distributed data pipelines for preprocessing and dataset generation.
- Own workflow orchestration, job scheduling, monitoring, and failure recovery for data processing jobs.
- Implement and maintain containerized pipeline infrastructure using Kubernetes.
- Optimize cloud-based data storage and movement for cost and operational efficiency.
- Define and implement best practices for dataset storage layout and versioning.
- Design curation pipelines for selecting and filtering video and image content.
- Build and improve VLM-based captioning and metadata generation workflows.
- Develop quality and aesthetic scoring models for data selection.
- Build tooling to support deduplication workflows at scale.
- Analyze dataset composition and iterate on curation logic.
Requirements
- Strong hands-on experience with large-scale data systems and pipelines for machine learning.
- Experience with distributed data processing frameworks like PySpark or Ray.
- Familiarity with containerization and orchestration tools such as Docker and Kubernetes.
- Experience with cloud-based data storage and compute (AWS, GCS, Azure).
- Experience with VLM-based captioning pipelines and quality scoring models.
- Familiarity with CLIP-based filtering and semantic data selection techniques.
- Familiarity with video processing tools like FFmpeg and OpenCV.
- Proficiency in Python.
- Strong problem-solving, communication, and documentation skills.
Benefits
- Competitive salary and generous company equity.
- Personal time off and paid holidays.
- Health insurance.
- Global travel insurance for international travel.
- Monthly spending stipend of $500.
- All necessary equipment for your home office.
