Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Machine Learning Systems Engineer in European Union.
We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions. This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems.
Accountabilities:
- Contribute to the development and optimization of large-scale language model frameworks
- Implement high-performance distributed training algorithms using frameworks such as Megatron-LM, DeepSpeed, and vLLM
- Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement
- Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools
- Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming
- Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues
- Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines
- 3+ years of experience in machine learning engineering or research
- Proficiency in Python and C/C++, with strong systems programming skills
- Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training
- Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training
- Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism
- Experience with containerization (Docker, Kubernetes) and cluster orchestration
- Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar
- Commitment to open-source development and community collaboration
- Excellent problem-solving, debugging, and performance optimization skills
- Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects
- Competitive compensation including salary and equity participation
- Fully remote, work-from-anywhere flexibility
- Comprehensive global benefits including mental health support
- Open PTO policy and flexible working hours
- Paid parental leave and support for personal well-being
- Opportunities for continuous learning and professional development
- Regular team offsites, virtual events, and global gatherings to foster team collaboration
- Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing
When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
- π Our AI thoroughly analyzes your CV and LinkedIn profile, evaluating your skills, experience, and achievements
- π It compares your profile against the job's core requirements and past success factors to calculate a match score
- π― The top 3 candidates with the highest match are automatically shortlisted
- π§ When necessary, our human team may perform additional review to ensure no strong candidate is overlooked
Thank you for your interest!