Location
singapore
Job Type
Full-time
Posted
June 23, 2026

Job Description

Site Reliability Engineer - Machine Learning Systems (Singapore)

Job Code: A A

Responsibilities
  • Ensure our ML systems operate efficiently for large model deployment, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi‑data center, multi‑region, and multi‑cloud scenarios.
  • Manage resource planning, cost, and budget, including computing and storage resources.
  • Implement global system disaster recovery, cluster machine governance, and enhance business service stability, resource utilization, and operational efficiency.
  • Build software tools, products, and systems to monitor and manage ML infrastructure and services efficiently.
  • Participate in the global team roster that ensures system and business on‑call support.
Minimum Qualifications
  • Bachelor’s degree or above in Computer Science, Computer Engineering, or related fields.
  • Stro...

Ready to Apply?

Submit your application for Site Reliability Engineer - Machine Learning Systems (Singapore) Technology - Backend Singapore[...] at ByteDance

Apply Now