To design and optimize infrastructure for GenAI and LLM workloads, the full-time remote Architect - Platform Engineer will implement scalable solutions, perform GPU profiling, and manage compute-intensive jobs while collaborating with cross-functional teams to deploy cutting-edge AI applications. Key Responsibilities Design and implement scalable infrastructure for LLM and GenAI workloads across multi-GPU environments Perform GPU profiling, benchmarking, and performance optimization for distributed training workloads Manage and schedule compute-intensive jobs using Slurm-based clusters and OpenShift/Kubernetes environments Required Qualifications Strong experience with Slurm and distributed training environments Hands-on expertise with Red Hat OpenShift and/or Kubernetes Deep knowledge of the NVIDIA GPU ecosystem (CUDA, cuDNN, NCCL, Triton) Experience deploying GenAI workloads (LLM fine-tuning, RAG pipelines) Familiarity with Infrastructure-as-Code tools (Terraform, Ansible)
Create an account to see the full posting, access our search engine, and more.You're just 60 seconds away from your new Creativeloft account.