Overview
The Senior MLOps Engineer will design, implement, and manage the end-to-end machine learning lifecycle on GPU-accelerated platforms. This role focuses on scalable AI infrastructure, model deployment, monitoring, and automation on top of NVIDIA GPU clusters, ensuring optimized performance for training, fine-tuning, and inference workloads. The engineer will work closely with infrastructure, product, and AI / ML teams to deliver reliable GPUaaS for enterprise and sovereign AI use cases.
Key Responsibilities
Model Lifecycle Management
- Build pipelines for training, fine-tuning, and inference using NVIDIA GPUs (H100, B200, GB200, GB300).
- Manage model packaging and deployment on Kubernetes / Slurm clusters with GPU scheduling.
- Support large-scale workloads with NVIDIA Base Command Manager (BCM) and multi-tenant orchestration.
- Automate CI / CD pipelines for ML models using GitOps and modern DevOps tools.
Infrastructure Integration
Work with NCP RA-based GPU clusters to integrate MLOps pipelines with compute / storage / network fabrics.Optimize model training performance using CUDA, cuDNN, NCCL, and Triton Inference Server.Deploy and scale inference services with TensorRT, Triton, and NVIDIA AI Enterprise frameworks.Monitoring & Optimization
Implement observability for GPU utilization, model performance, and latency metrics.Develop dashboards using Prometheus, Grafana, ELK, and NVIDIA DCGM.Optimize cost and throughput by improving GPU scheduling, auto-scaling, and multi-tenancy policies.Security & Compliance
Implement secure data pipelines and enforce model governance frameworks.Align deployments with ISO 27001, SOC2, GDPR, and regional compliance.Collaboration & Operations
Partner with Data Scientists and ML Engineers to productize models on GPUaaS.Work with Platform Engineers to ensure seamless provisioning and Day-2 operations.Create SOPs and automation playbooks for AI workload deployment.Provide L3 / L4 support for production AI pipelines.Required Skills & Experience
Strong background in MLOps and ML pipeline automation.Hands-on experience with NVIDIA GPU-accelerated frameworks (CUDA, TensorRT, Triton).Proficiency in Kubernetes, Slurm, Helm, ArgoCD, Kubeflow, MLflow, or Vertex AI.Strong coding skills in Python, Bash, Go (automation + ML pipelines).Familiarity with observability tools (Prometheus, Grafana, ELK, NVIDIA DCGM).Experience with containerization (Docker, Singularity) for ML workloads.Seniority level
Mid-Senior levelEmployment type
Full-timeJob function
Other#J-18808-Ljbffr