Design, deploy, and operate production ML infrastructure across Dev, QA, and Prod environments. Manage ML deployment pipelines and runtime operations in AWS SageMaker. Implement monitoring, observability, and governance for large-scale multimodal AI workloads.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
About the Opportunity
Our client, a globally recognized media and information organization, is building out the operational foundation for large-scale production machine learning systems supporting enterprise intelligence products.
This role focuses on deploying, operating, scaling, and governing ML infrastructure and inference services across multimodal AI workloads involving text, image, and video processing. The environment is highly technical and collaborative, with a strong emphasis on production reliability, scalability, observability, and AWS-based ML infrastructure.
What You’ll Be Doing
- Design, deploy, and operate production ML infrastructure across Dev, QA, and Prod environments
- Manage ML deployment pipelines and runtime operations in AWS SageMaker
- Configure and optimize GPU/CPU infrastructure for large-scale inference workloads
- Implement monitoring, alerting, drift detection, and observability for ML systems
- Build deployment governance processes including rollout, rollback, and recovery strategies
- Support high-throughput ML workloads across text, image, and video pipelines
- Optimize infrastructure scalability, cost efficiency, and operational reliability
- Partner with ML Engineers and Data Scientists to operationalize new models and workflows
- Implement A/B testing and controlled rollout strategies for production ML systems
Required Qualifications
- Hands-on experience deploying and operating ML systems in production
- Strong AWS SageMaker experience, including:
- Pipelines
- Endpoints
- Monitoring
- Multi-environment deployments
- Experience with containerized ML deployment and orchestration
- Experience operating PyTorch and TensorFlow inference systems
- Strong understanding of autoscaling, infrastructure optimization, and runtime reliability
- Experience implementing monitoring and observability frameworks for ML systems
- Experience supporting distributed ML workloads in cloud environments
Interested in remote work opportunities in Machine Learning & AI? Discover Machine Learning & AI Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
Strongly Preferred
- Experience supporting NLP and computer vision ML systems
- Familiarity with semantic/vector search infrastructure
- Experience with ranking/reranking systems
- Familiarity with ANN/vector indexing approaches
- Experience supporting large-scale text, image, and video processing pipelines
- Experience optimizing GPU-based infrastructure
What This Role Is
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
- Production MLOps Engineering
- ML Infrastructure & Deployment
- Runtime Reliability & Scalability
- AWS ML Operations
- Monitoring & Operational Governance
What This Role Is Not
- Pure DevOps
- Model Architecture Design
- Data Science Ownership
- Research ML Engineering
Additional Details
- Fully remote
- Preference for East Coast collaboration hours
Applicants must be legally authorized to work in the United States and must not require employer sponsorship now or in the future.
Similar Jobs
Explore other opportunities that match your interests
devengine.ca
sundayy