عرض العمل هذا غير متوفر في بلدك.

Senior MLOps Engineer

EdgnexDubai, Dubai, United Arab Emirates

5 منذ أيام

الوصف الوظيفي

Overview

The Senior MLOps Engineer will design, implement, and manage the end-to-end machine learning lifecycle on GPU-accelerated platforms. This role focuses on scalable AI infrastructure, model deployment, monitoring, and automation on top of NVIDIA GPU clusters, ensuring optimized performance for training, fine-tuning, and inference workloads. The engineer will work closely with infrastructure, product, and AI / ML teams to deliver reliable GPUaaS for enterprise and sovereign AI use cases.

Key Responsibilities

Model Lifecycle Management

Build pipelines for training, fine-tuning, and inference using NVIDIA GPUs (H100, B200, GB200, GB300).
Manage model packaging and deployment on Kubernetes / Slurm clusters with GPU scheduling.
Support large-scale workloads with NVIDIA Base Command Manager (BCM) and multi-tenant orchestration.
Automate CI / CD pipelines for ML models using GitOps and modern DevOps tools.

Infrastructure Integration

Work with NCP RA-based GPU clusters to integrate MLOps pipelines with compute / storage / network fabrics.

Optimize model training performance using CUDA, cuDNN, NCCL, and Triton Inference Server.

Deploy and scale inference services with TensorRT, Triton, and NVIDIA AI Enterprise frameworks.

Monitoring & Optimization

Implement observability for GPU utilization, model performance, and latency metrics.

Develop dashboards using Prometheus, Grafana, ELK, and NVIDIA DCGM.

Optimize cost and throughput by improving GPU scheduling, auto-scaling, and multi-tenancy policies.

Security & Compliance

Implement secure data pipelines and enforce model governance frameworks.

Align deployments with ISO 27001, SOC2, GDPR, and regional compliance.

Collaboration & Operations

Partner with Data Scientists and ML Engineers to productize models on GPUaaS.

Work with Platform Engineers to ensure seamless provisioning and Day-2 operations.

Create SOPs and automation playbooks for AI workload deployment.

Provide L3 / L4 support for production AI pipelines.

Required Skills & Experience

Strong background in MLOps and ML pipeline automation.

Hands-on experience with NVIDIA GPU-accelerated frameworks (CUDA, TensorRT, Triton).

Proficiency in Kubernetes, Slurm, Helm, ArgoCD, Kubeflow, MLflow, or Vertex AI.

Strong coding skills in Python, Bash, Go (automation + ML pipelines).

Familiarity with observability tools (Prometheus, Grafana, ELK, NVIDIA DCGM).

Experience with containerization (Docker, Singularity) for ML workloads.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Other

#J-18808-Ljbffr

إنشاء تنبيه وظيفي لهذا البحث

Senior Engineer • Dubai, Dubai, United Arab Emirates