Job Description
Homebrew is an AI R&D Lab. We train our own models, are the creators and maintainers of popular open-source AI tools:
- Jan: Desktop Copilot (>1 million downloads)
- Cortex: Local, open-source alternative to OpenAI Platform
- Menlo: GPU Training Cluster
We are a fully remote company. In the long term, our objective is to train useful, safe AI that helps improve humanity.
Job Description
Homebrew is looking for an Infrastructure Engineer to help run our GPU Training Cluster, internal GPU Cloud. Please note that this is an On-Premise role, as we build our own infrastructure.
Responsibilities
- Design and maintain the organization's infrastructure, including compute and storage nodes, high-bandwidth networking infrastructure, and security and monitoring infrastructure
- Design and maintain software for infrastructure management and orchestration (e.g. Openstack, Kubeflow, Proxmox, etc)
- Participate in incident response and resolution to ensure high availability and performance
- Develop and maintain solutions for day-to-day operational administration, system/data backup, disaster recovery, and security/performance monitoring.
- Collaborate with Engineering team to implement DevSecOps practices (e.g. IAAC, CI/CD)
Requirements
- Familiar with on-premise Infrastructure (e.g. Racks with power, storage, compute, network nodes)
- Ability to do basic to intermediate hardware troubleshooting, servicing and repairs
- [Plus] Experience with Slurm, Kubeflow or alternative cluster orchestration tools
- [Plus] Experience with Openstack, VMWare, Proxmox or alternative cloud orchestrator tools
- [Plus] Experience with designing GPU Clusters or HPC systems (inter-cluster networking)
- [Plus] Familiarity with software-defined storage technologies (Ceph, ZFS, NFS, etc.)
Benefits
- We pay an “all-in” pay and you will cover your own insurance/medical from the amount.
- 14 days leave (and unlimited sick days)
- Annual equipment budget (once 2 month probation has been completed)