The Role:
We're looking for a Cloud Infrastructure Engineer who thrives on building and scaling large-scale GPU compute platforms fast. You'll be instrumental in developing and managing the foundational infrastructure that powers our AI workloads. Our core infrastructure relies heavily on Python, Kubernetes (K8s), Terraform, and Ansible, but we care more about your ability to learn, adapt, and ship robust solutions than whether you've used these exact tools before.
You are a good fit if this describes you:
- You excel at building and managing distributed compute platforms, especially those involving GPU resources.
- You have deep expertise in backend systems that orchestrate complex workloads efficiently, managing capacity and resource constraints.
- You possess a strong understanding of foundational cloud infrastructure (AWS/GCP/Azure) and Linux provisioning/management tools.
- You know how to design for reliability and scale with minimal operational overhead.
- You learn new technologies rapidly because you're excited by solving hard infrastructure challenges.
- You've scaled infrastructure before and understand the tradeoffs that matter.
- You think most infrastructure moves too slowly and could be way better automated and optimized.
- You're comfortable diving into unfamiliar systems and making them work reliably.
- You are a self-starter who executes quickly, takes ownership, and constantly seeks improvement.
What you'll do:
- Develop and maintain our core Python platform for routing requests, orchestrating AI workloads, managing GPU server capacity, observability, and more.
- Develop and maintain our infrastructure layer using Terraform, Ansible, and cloud provider APIs to manage our fleet of GPU workers across cloud and potentially bare metal environments.
- Own and operate the technologies underpinning our platform, potentially including K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking/storage, etc.
- Architect and implement solutions that directly impact the performance and availability of services for millions of ComfyUI users.
- Work closely with our core engineering team to design and build new infrastructure systems.