Site Reliability Devops Engineer
نظرة عامة على الوظيفة
-
تاريخ الإعلانمارس 29, 2026
-
الموقع
-
تاريخ إنتهاء الصلاحية--
المسمى الوظيفي
411_3413658
About the Role
We’re looking for a talented Site Reliability Engineer (SRE) to keep our systems running smoothly, reliably, and at scale. Through smart automation, deep observability, and a calm head in a crisis, you’ll help us balance speed, compliance, and stability, working alongside DevOps, Cloud, Quality Engineering, and Product teams to drive continuous improvements in performance, security, and resilience.
What You Will Be Doing
- Define and implement SLIs / SLOs and error budgets for business‑critical digital banking services.
- Build actionable observability (metrics, logs, traces, dashboards, and alerts) using Dynatrace, Prometheus, Grafana, and ELK, while reducing alert fatigue.
- Leverage AI‑driven insights and anomaly detection (Dynatrace Davis AI or equivalent AIOps platform) to proactively predict and resolve reliability issues before impact.
- Lead incident management — from on‑call triage and root‑cause analysis to blameless postmortems with actionable follow‑ups.
- Improve deployment safety with robust rollout / rollback strategies, canary and blue‑green deployments, and production readiness reviews.
- Support and optimize microservices‑based architectures, ensuring service reliability, scalability, and inter‑service resilience.
- Conduct capacity planning, performance tuning, and resilience testing, optimizing for both reliability and cost efficiency.
- Automate operational toil — from runbooks and remediation scripts to proactive health checks and self‑healing workflows.
- Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps).
- Own and evolve the observability and AIOps stack, driving intelligent automation and predictive alerting capabilities.
- Maintain high‑quality documentation, playbooks, and operational standards across environments.
- Ensure operational compliance and security alignment with internal controls and regulatory standards.
- Analyze system performance, availability, and cost data to continually optimize operations.
- Provide reliability support and escalation guidance for critical production systems during major incidents.
Experience and Qualifications
- 5+ years of experience in SRE or DevOps roles, building and managing large‑scale, high‑availability systems across banking, fintech, e‑commerce, or other data‑intensive digital ecosystems.
- Bachelor’s degree in Computer Science or equivalent technical experience.
- Strong experience with Linux environments and performance troubleshooting.
- Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies.
- Proficiency with Kubernetes and container orchestration in microservices environments.
- Hands‑on experience with AWS (preferred); exposure to Azure or GCP is an advantage.
- Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack.
- Experience implementing AI / ML‑driven reliability or automation solutions (AIOps, anomaly detection, predictive alerting).
- Practical understanding of CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps).
- Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases.
- Strong scripting or programming skills in Python, Bash, or Go.
#J-18808-Ljbffr
2026-03-23 09:01:33