about 3 hours ago
Tokyo, Japan
Senior / Staff+
Responsibilities
- Architect, implement, and scale batch and streaming pipelines for ML training and evaluation.
- Design and operate robust ML job execution frameworks for training, inference, and post-processing.
- Build and maintain internal API servers and developer tools for orchestrating ML jobs on Kubernetes.
- Design and monitor data infrastructure using ClickHouse and PostgreSQL.
- Ensure high availability and observability through monitoring tools like Prometheus and Grafana.
- Collaborate with data scientists, product managers, and engineers to deliver efficient ML platform capabilities.
- Promote the use of LLM-based tools to accelerate development and debugging.
- Mentor junior engineers and help evolve team engineering culture.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related field; Master’s preferred.
- 4+ years of hands-on experience in data systems, machine learning infrastructure, or platform engineering.
- Strong coding proficiency in Python and/or Java, with experience in large-scale production systems.
- Practical experience with Spark, Flink, Kubernetes, and infrastructure-as-code tools like Terraform and Helm.
- Experience managing high-throughput data infrastructure using ClickHouse, PostgreSQL, or similar systems.
- Deep understanding of ML pipelines and distributed job execution in production environments.
- Proven ability to apply LLM-based tools to boost engineering productivity.
- Strong ownership, architectural thinking, and ability to lead cross-functional platform projects.
Tech Stack
Apache FlinkApache SparkArgo CDClickHouseGrafanaHelmJavaKubernetesPostgreSQLPrometheusPythonTerraform
Categories
AI & MLData Engineering