Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
High Performance Computing Software Engineer - Supercomputing image - Rise Careers
Job details

High Performance Computing Software Engineer - Supercomputing

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

 

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

 

The Role

 

IFM is building the foundational compute infrastructure that will power tomorrow’s breakthroughs in AI and computational science. We’re looking for a High Performance Computing Software Engineer to help us design, develop, and operate the software systems that run our large-scale AI workloads.

 

In this role, you’ll work at the intersection of high-performance computing and machine learning. You’ll be part of a team responsible for crafting the software stack that enables training of cutting-edge ML models—spanning 1000+ GPUs—and ensuring our infrastructure is robust, performant, and developer-friendly.

Job Responsibilities

  • Design and implement high-performance, distributed software solutions for large-scale AI/ML training.
  • Optimize low-level system components including Linux kernel, GPU/accelerator kernels, and interconnects.
  • Develop and tune communication libraries such as NCCL, MPI, UCX, RCCL, and RDMA-based systems.
  • Partner with ML researchers and engineers to support frameworks like PyTorch, MegatronLM, and DeepSpeed in large-scale production environments.
  • Contribute to our scheduling, orchestration, and job management systems, including Slurm and Kubernetes.
  • Debug and resolve complex issues across the stack—from kernel to container to model.
  • Work closely with hardware vendors, upstream open-source communities, and internal teams to drive performance and reliability improvements.

Skills & Experience

  • Proven experience developing and optimizing software for large-scale ML workloads (1000+ GPUs preferred).
  • Deep understanding of Linux kernel internals and accelerator (GPU) kernel development.
  • Proficiency with distributed communication libraries (e.g., NCCL, RCCL, MPI, UCX, SHARP, Libfabric).
  • Experience with ML frameworks like PyTorch, TensorFlow, JAX, or MegatronLM.
  • Strong knowledge of HPC job scheduling and orchestration tools (e.g., Slurm, Kubernetes, Pyxis).
  • Excellent debugging and systems performance tuning skills.
  • A collaborative mindset with a focus on shared success and technical excellence.


$150,000 - $300,000 a year
Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
 

Average salary estimate

$225000 / YEARLY (est.)
min
max
$150000K
$300000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User
Posted 17 hours ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Take Risks
Startup Mindset
Collaboration over Competition
Fast-Paced
Growth & Learning
Dental Insurance
Vision Insurance
Disability Insurance
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Performance Bonus
Family Medical Leave
Paid Holidays

A product-minded Engineering Manager is needed to lead and grow engineering teams, drive technical execution for distributed, service-oriented systems, and partner cross-functionally to deliver impactful scheduling products.

Posted 16 hours ago

Temporal is hiring a Staff Software Engineer to lead the architecture and operation of internal builder tools and AI-driven agent platforms that improve developer flow and reliability across the organization.

Photo of the Rise User
AVEVA Hybrid San Leandro, California, United States of America
Posted 16 hours ago

Lead and architect enterprise-scale AI initiatives at AVEVA, translating cutting-edge AI research into production-ready architectures, repeatable patterns, and cross-functional delivery across industrial domains.

Photo of the Rise User
Syngenta Group Hybrid Slater, Iowa, United States
Posted 20 hours ago

Syngenta Seeds is hiring a Full-Stack Engineer to build scalable web applications that translate AI/ML capabilities into intuitive tools for growers and global users.

Photo of the Rise User
CDW Hybrid Virtual - Illinois
Posted 4 hours ago

CDW is hiring a remote Software Engineer I (Backend) to build and maintain Flask-based REST and GraphQL APIs on AWS while ensuring quality, performance, and secure production operations.

Photo of the Rise User

Senior technical leader sought to shape LinkedIn’s core infrastructure strategy and lead cross-team initiatives across networking, storage, and messaging at massive scale.

Photo of the Rise User
Dental Insurance
Disability Insurance
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Vision Insurance
Performance Bonus
Family Medical Leave
Paid Holidays

GoodLeap is hiring a Senior Full-Stack Software Engineer/Tech Lead to drive frontend-focused, full-stack initiatives and build scalable, AI-enabled finance platform features while mentoring teammates.

Photo of the Rise User
Parloa Hybrid Remotely in the USA
Posted 17 hours ago

Design and build AI‑enabled internal systems and integrations to scale Parloa’s Go‑To‑Market operations using TypeScript, Python, and modern AI tooling.

Photo of the Rise User
CoLab Software Hybrid North America, Remote
Posted 16 hours ago

Senior product-minded engineer needed to prototype, architect, and de-risk browser-based 2D/3D CAD and engineering-data systems for a remote-first AI platform used by major OEMs.

Photo of the Rise User

CSCI Consulting is seeking an experienced MuleSoft Integration Developer to design and implement secure, high-performance integrations and API-led connectivity for a major Federal modernization program.

Photo of the Rise User
Posted 19 hours ago

Workday is hiring a Principal Software Engineer to own and evolve AI-native infrastructure tooling and automation across large-scale, distributed platform environments.

Photo of the Rise User
Pinterest Hybrid San Francisco, CA, US; Palo Alto, CA, US
Posted 4 hours ago

Lead cross-team engineering to build scalable catalog, integration, and AI-native merchant systems that improve onboarding, catalog health, and merchant growth at Pinterest.

Photo of the Rise User

Make infrastructure resilient and scalable at Visa by building automation, database reliability tooling, and GenAI-powered engineering assistants on the Product Reliability Engineering team in Austin.

MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, onsite
DATE POSTED
April 5, 2026
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!