Senior Cloud Services Software Engineer
Job Description
Join NVIDIA’s DGX Cloud Team: Shape the Future of AI
Senior Distributed Software Engineer – AI Infrastructure
Do you want to be at the forefront of Artificial Intelligence innovation? NVIDIA’s DGX Cloud team is seeking a talented and passionate Senior Distributed Software Engineer to contribute to the very foundation upon which our groundbreaking AI research is built. As a key member of our team, you’ll be instrumental in crafting and optimizing the AI infrastructure services that deliver unparalleled performance and resilience for DGX Cloud.
Imagine playing a critical role in building a stable, scalable environment that empowers AI researchers with the resources and scale they need to make breakthrough discoveries. This is more than just a job; it’s an opportunity to push the boundaries of technology and help redefine what’s possible in the world of AI and cloud computing. Work alongside a world-class team of engineers who share your passion for innovation and excellence. Plus, don’t you think the future of AI looks so bright it’s just, well… GLIMMERING?
What You’ll Be Doing
As a Software Engineer specializing in backend development, you’ll collaborate within a dedicated team to elevate the infrastructure and products that power NVIDIA’s AI platforms. Your work will be essential in enabling cutting-edge AI research, with a focus on:
- Developing innovative solutions at the intersection of machine learning, distributed systems, and high-performance computing, directly contributing to the advancement of AI technologies.
- Designing, developing, and optimizing (micro-)services orchestrated by Kubernetes to enable large-scale AI training workflows on AI training supercomputers hosted at major CSPs, emphasizing both resilience and efficiency.
- Co-designing and implementing APIs that seamlessly integrate these services with NVIDIA’s comprehensive resilience stack, ranging from tier-0 telemetry to break/fix automation and advanced checkpointing systems.
- Crafting a user-friendly submission abstraction that empowers model engineers and training platforms/frameworks to submit long-running training jobs effortlessly, hiding the underlying complexity of infrastructure failures, job lifecycle management with auto-restarts, and ensuring optimal resource utilization, while providing clear and timely feedback.
- Developing modular services that can be easily coordinated with and deployed onto on-premises AI clusters utilizing NVIDIA Hardware and Cloud services.
What We Need To See
- Bachelor’s degree or higher in Computer Science or a related technical field (or equivalent experience).
- 5+ years of hands-on experience in backend development, proficient in languages such as Python, Go, C/C++, or similar high-performance languages.
- Proven track record of building and scaling large-scale distributed systems.
- Experience with leading cloud computing platforms like AWS, Azure, and GCP, as well as container technologies such as Docker and Kubernetes, and HPC/AI platforms like Slurm.
Ways To Stand Out From The Crowd
- Real-world experience with popular DL frameworks and orchestrators like PyTorch, TensorFlow, JAX, and Ray.
- Experience in developing framework plugin architectures that enable seamless integration between frameworks and cluster schedulers, transparent to end-users.
- Strong understanding of NVIDIA GPUs, network technologies, and their associated failure patterns.
- Experience with AI models and AI-based tools.
- Demonstrable contributions to open-source projects or personal code repositories.
NVIDIA is at the forefront of the AI revolution, pioneering groundbreaking advancements in Artificial Intelligence, High-Performance Computing, and Visualization. Our GPU, the visual cortex of modern computers, is at the heart of our products and services, enabling new frontiers in exploration, creativity, and discovery. From artificial intelligence to autonomous vehicles, our work is transforming science fiction into reality.
NVIDIA is seeking exceptional individuals like you to join us in accelerating the next wave of artificial intelligence!
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is $184,000 – $287,500 for Level 4, and $224,000 – $356,500 for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until August 10, 2025.
NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We highly value diversity in our current and future employees and do not discriminate based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
NVIDIA is the world leader in accelerated computing, pioneering solutions to tackle challenges that were once deemed impossible. Our work in AI and digital twins is transforming industries and profoundly impacting society. Learn more about NVIDIA.
Please mention the word GLIMMERING and tag RMzguNjguMTM0LjE5NA== when applying to show you read the job post completely (#RMzguNjguMTM0LjE5NA==). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they’re human.
“
Similar Jobs

