Experience: 5-10

Role Brief:

MLOps Engineer

We are seeking a skilled and experienced MLOps Engineer to join our team and drive the operationalization of machine learning (ML) and large language model (LLM) pipelines at scale. The ideal candidate will be responsible for automating, deploying, monitoring, and maintaining AI/ML solutions—transforming prototypes into robust, customer-ready systems while mitigating risks such as production pipeline failures.

This role requires strong expertise in cloud infrastructure, CI/CD pipelines, model orchestration, and close collaboration with cross-functional teams to ensure seamless deployment across diverse customer environments.

Key Responsibilities

Design and implement scalable infrastructure for ML/LLM pipelines using AWS services such as AWS Batch, Fargate, Bedrock , and related tools
Manage auto-scaling mechanisms to handle fluctuating workloads and ensure high availability of REST APIs
Automate CI/CD pipelines and AWS Lambda functions for model testing, deployment, and updates to reduce manual errors and improve efficiency
Build and manage end-to-end ML workflows using Amazon SageMaker Pipelines and optimize workflows using AWS Step Functions
Perform drift analysis (data drift, concept drift, and label drift) and implement mitigation strategies such as:
Automated alerts
Model retraining triggers
Performance audits
Set up reproducible workflows for data preparation, model training, and deployment
Provision and optimize cloud resources (GPUs, memory, compute) to support large-scale models , including RAG-based systems
Automate model retraining workflows to ensure models stay updated as data evolves
Collaborate closely with Data Scientists, ML Engineers, and DevOps teams to integrate models into production environments
Implement monitoring and model observability frameworks to track model performance and detect degradation or drift in real time
Build monitoring dashboards and real-time alerting for pipeline failures and performance issues

Required Skills & Qualifications

Education: BE / BTech / ME / MTech (Any Engineering discipline)
Experience: Minimum 4+ years of hands-on experience with AWS services, including:
AWS Lambda
Amazon Bedrock
AWS Batch with Fargate
Amazon RDS (PostgreSQL)
DynamoDB
SQS
CloudWatch
API Gateway
Amazon SageMaker
Strong hands-on experience in drift detection and mitigation for production ML systems
Working knowledge of ML frameworks such as PyTorch and TensorFlow to understand model deployment requirements
Experience building and deploying REST APIs using FastAPI or Flask
Familiarity with model observability and monitoring tools , such as:
Evidently
NannyML
Phoenix
Grafana
Experience with retraining and orchestration tools like MLflow, Kubeflow, or Airflow

Good to Have

AWS Certified Machine Learning – Specialty certification

Ml Ops Engineer

Job description