What to Expect
As a Software Engineer within the Supercomputing AI Infrastructure team, you will work on scaling and optimizing our training compute clusters at the core of Robotaxi and the Optimus development.
Tesla Supercomputing maintains, on premise, some of the largest supercomputers in the world. We're currently scaling 100K+ GPU clusters, which are central to developing our autonomy capabilities. Robustly managing such a large system of AI training and HPC clusters requires software design through the stack: from the operating system to workload scheduling to ultimately the training loop.
We are building and improving the workload scheduling systems that govern how hundreds of thousands of GPUs are allocated across training jobs, experiments, and data pipelines. In this role, you will own the Slurm-based scheduler at the heart of our AI HPC compute infrastructure, ensuring that cluster resources are utilized efficiently and that engineers can iterate quickly without friction
What You'll Do
- Own and evolve the Slurm scheduler configuration, plugins, and policies that govern job scheduling across our GPU supercomputers
- Design and implement scheduling policies that maximize cluster utilization, minimize job queue times, and enforce fair-share allocation across teams and priorities
- Build tooling and automation around job submission, preemption, backfill, and resource reservation to support the dynamic needs of AI engineering
- Develop monitoring and observability infrastructure to provide real-time visibility into cluster utilization, job throughput, and scheduling efficiency
- Debug and root cause scheduling failures and resource contention issues across thousands of nodes and implement fixes to prevent recurrence
- Coordinate with the operations team managing the training cluster to maintain high availability and job throughput
- Work closely with the ML team to understand workload patterns and evolving resource requirements
What You'll Bring
- Members of the Supercomputing AI Infrastructure team are expected to be adaptable to the dynamic requirements of AI software engineering and capable of contributing across all parts of the AI and HPC software stack
- Deep hands-on experience with Slurm (configuration, plugins, scheduling algorithms, accounting) or comparable cluster/workload scheduling systems (e.g., PBS Pro/OpenPBS, LSF, Grid Engine, HTCondor, Kubernetes-based batch schedulers)
- Strong knowledge of Python and Linux systems administration
- Experience operating and tuning HPC job schedulers at scale (>=thousands of nodes)
- Understanding of GPU cluster topologies, networking fabrics, and how scheduling decisions impact training performance
- Experience building tooling and automation for cluster operations (job orchestration, resource monitoring, capacity planning)
- Familiarity with containerization and orchestration technologies (Docker, Kubernetes, Pyxis/Enroot)
- Understanding of modern machine learning training workflows and resource requirements
- Experience with parallel programming concepts and distributed systems
Compensation and Benefits
Benefits
Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:
- Medical plans > plan options with $0 payroll deduction
- Family-building, fertility, adoption and surrogacy benefits
- Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
- Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
- Healthcare and Dependent Care Flexible Spending Accounts (FSA)
- 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
- Company paid Basic Life, AD&D
- Short-term and long-term disability insurance (90 day waiting period)
- Employee Assistance Program
- Sick and Vacation time (Flex time for salary positions, Accrued hours for Hourly positions), and Paid Holidays
- Back-up childcare and parenting support resources
- Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
- Weight Loss and Tobacco Cessation Programs
- Tesla Babies program
- Commuter benefits
- Employee discounts and perks program
Expected Compensation
$140,000 - $252,000/annual salary + cash and stock awards + benefits
Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
|