Senior MLOps Engineer

نظرة عامة على الوظيفة

  • تاريخ الإعلان
    مايو 26, 2026
  • الموقع
  • تاريخ إنتهاء الصلاحية
    سبتمبر 18, 2026

المسمى الوظيفي

411_2952016

Brief description of the vacancy

نبحث عنSenior MLops Engineer with proven experience in deploying and managing large-scale ML infrastructure forLLMs ,TTS ,STT ,Stable Diffusion , and otherGPU-intensive models in production. You will lead the design and operation ofcost-efficient ,high-availability , وhigh-performance serving stacks in a Kubernetes-based AWS environment.

About the company

Company Identity AI Labs

A fast-growing and well-funded AI startup in the UAE. Mission of the company is to redefine how humans interact with AI through emotionally intelligent, relationship-focused technology

المسؤوليات

  • You will architect, deploy, and maintain scalable ML infrastructure onAWS EKS usingTerraform andHelm .
  • You will ownend-to-end model deployment pipelines for LLMs, diffusion models (LDM/Stable Diffusion), and other generative/AI models requiringhigh GPU throughput .
  • You will design cost-effective, auto-scaling serving systems using tools likeTriton Inference Server ,vLLM ,Ray Serve , or similar model-serving frameworks.
  • You will build and maintain CI/CD pipelines integrating the ML model lifecycle (training → validation → packaging → deployment).
  • You will optimize GPU resource utilization and implement job orchestration with frameworks likeKServe ,Kubeflow , orcustom workloads on EKS .
  • You will deploy and manageFluxCD (orArgoCD ) for GitOps-based deployment and environment promotion.
  • You will implement robust monitoring, logging, and alerting for model health and infrastructure performance (Prometheus, Grafana, Loki).
  • You will collaborate closely with ML Engineers and Software Engineers to ensure smooth integration, observability, and feedback loops.

المتطلبات

  • 2–3 years of experience with model serving frameworks such asTriton ,vLLM ,Ray Serve ,TorchServe , or similar.
  • 2–3 years of experience deploying and optimizingLLMs andLDMs (e.g., Stable Diffusion) under high load withGPU-aware scaling .
  • 3–4 years of experience with Kubernetes (EKS) and infrastructure-as-code (Terraform , Helm).
  • 4–5 years of hands-on software engineering experience inPython , with production-grade experience in ML model lifecycle.
  • Nice to have: familiarity withGo orRust for backend or performance-critical systems.

Working conditions

Full time job in Dubai office, official employment and full relocation package

Contacts

Log In Only registered users can open employer contacts.

#J-18808-Ljbffr

2026-05-25 15:53:43