Talent.com
عرض العمل هذا غير متوفر في بلدك.
Senior MLOps Engineer

Senior MLOps Engineer

EdgnexDubai, Dubai, United Arab Emirates
5 منذ أيام
الوصف الوظيفي

Overview

The Senior MLOps Engineer will design, implement, and manage the end-to-end machine learning lifecycle on GPU-accelerated platforms. This role focuses on scalable AI infrastructure, model deployment, monitoring, and automation on top of NVIDIA GPU clusters, ensuring optimized performance for training, fine-tuning, and inference workloads. The engineer will work closely with infrastructure, product, and AI / ML teams to deliver reliable GPUaaS for enterprise and sovereign AI use cases.

Key Responsibilities

Model Lifecycle Management

  • Build pipelines for training, fine-tuning, and inference using NVIDIA GPUs (H100, B200, GB200, GB300).
  • Manage model packaging and deployment on Kubernetes / Slurm clusters with GPU scheduling.
  • Support large-scale workloads with NVIDIA Base Command Manager (BCM) and multi-tenant orchestration.
  • Automate CI / CD pipelines for ML models using GitOps and modern DevOps tools.

Infrastructure Integration

  • Work with NCP RA-based GPU clusters to integrate MLOps pipelines with compute / storage / network fabrics.
  • Optimize model training performance using CUDA, cuDNN, NCCL, and Triton Inference Server.
  • Deploy and scale inference services with TensorRT, Triton, and NVIDIA AI Enterprise frameworks.
  • Monitoring & Optimization

  • Implement observability for GPU utilization, model performance, and latency metrics.
  • Develop dashboards using Prometheus, Grafana, ELK, and NVIDIA DCGM.
  • Optimize cost and throughput by improving GPU scheduling, auto-scaling, and multi-tenancy policies.
  • Security & Compliance

  • Implement secure data pipelines and enforce model governance frameworks.
  • Align deployments with ISO 27001, SOC2, GDPR, and regional compliance.
  • Collaboration & Operations

  • Partner with Data Scientists and ML Engineers to productize models on GPUaaS.
  • Work with Platform Engineers to ensure seamless provisioning and Day-2 operations.
  • Create SOPs and automation playbooks for AI workload deployment.
  • Provide L3 / L4 support for production AI pipelines.
  • Required Skills & Experience

  • Strong background in MLOps and ML pipeline automation.
  • Hands-on experience with NVIDIA GPU-accelerated frameworks (CUDA, TensorRT, Triton).
  • Proficiency in Kubernetes, Slurm, Helm, ArgoCD, Kubeflow, MLflow, or Vertex AI.
  • Strong coding skills in Python, Bash, Go (automation + ML pipelines).
  • Familiarity with observability tools (Prometheus, Grafana, ELK, NVIDIA DCGM).
  • Experience with containerization (Docker, Singularity) for ML workloads.
  • Seniority level

  • Mid-Senior level
  • Employment type

  • Full-time
  • Job function

  • Other
  • #J-18808-Ljbffr

    إنشاء تنبيه وظيفي لهذا البحث

    Senior Engineer • Dubai, Dubai, United Arab Emirates