6 days ago
Beijing, ChinaMid Level / Senior
Responsibilities
- Own reliability and operational stability of BJAK’s production systems.
- Design and improve monitoring, alerting, logging and observability across services.
- Lead incident response, troubleshooting and structured root cause analysis.
- Improve system resilience through redundancy, failover and recovery strategies.
- Work with engineers to design systems that are reliable, scalable and operable in production.
- Improve deployment safety through CI/CD pipelines, release strategies and automation.
- Reduce recurring incidents by identifying root causes and driving long-term fixes.
- Manage and optimize cloud infrastructure supporting business-critical workflows.
- Strengthen operational practices including on-call processes, incident playbooks and SLAs.
- Continuously improve system uptime, performance and operational maturity.
Requirements
- Experience in Site Reliability Engineering, DevOps, platform engineering or infrastructure roles.
- Strong understanding of distributed systems, cloud infrastructure and production operations.
- Experience with monitoring, alerting and observability tools.
- Strong troubleshooting skills for production incidents and system failures.
- Ability to design for reliability, scalability and fault tolerance.
- Experience working with CI/CD pipelines and deployment automation.
- Strong understanding of system performance, capacity planning and risk management.
- Hands-on ownership mindset during incidents and operational issues.
- Calm, structured and disciplined approach to production environments.
- Strong collaboration with engineering teams in fast-paced environments.
Benefits
- Support mission-critical automation at scale.
- Solve real-world reliability and distributed systems challenges.
- Work with experienced engineers across multiple countries.
- Fully remote position with collaboration from Malaysia-based teams.
- Build systems used across Southeast Asia markets.
- Support continuous technical growth and certifications with a learning budget.
- Strong autonomy over reliability and operational design.
- Focus on stability, observability and engineering excellence.
- Attractive salary package based on experience and impact.
