Location
taguig
Job Type
Full-time
Posted
June 03, 2026
Job Description
Responsibilities
- Provide 24/7 monitoring support for production systems and dashboards (Zabbix, Prometheus, Grafana, Huawei Cloud Monitoring).
- Continuously monitor key business metrics (QPS, error rate, response time, success rate) and infrastructure health indicators (CPU, memory, disk, and network usage).
- Perform routine inspections of online systems, data centre environments, and network connectivity.
- Maintain shift logs and inspection reports according to operational standards.
- Respond immediately to alerts received via phone, SMS, Microsoft Teams, etc., and perform initial troubleshooting based on SOPs.
- Identify alert severity levels and execute corresponding response procedures.
- Independently resolve common issues such as restarting services, cleaning up disk space, resource scaling, and traffic switching.
- Escalate unresolved issues to L2 support or development teams following the escalati...