Worldwide Remote Jobs

Senior Cloud Services Software Engineer

2100 NVIDIA USA
๐Ÿ“ US, CA, Santa Clara ๐Ÿ’ฐ $65k - $160k
Apply Now ๐Ÿ“… 5 days ago

Job Description

Join NVIDIA’s DGX Cloud Team: Shape the Future of AI

Senior Distributed Software Engineer – AI Infrastructure

Do you want to be at the forefront of Artificial Intelligence innovation? NVIDIA’s DGX Cloud team is seeking a talented and passionate Senior Distributed Software Engineer to contribute to the very foundation upon which our groundbreaking AI research is built. As a key member of our team, you’ll be instrumental in crafting and optimizing the AI infrastructure services that deliver unparalleled performance and resilience for DGX Cloud.

Imagine playing a critical role in building a stable, scalable environment that empowers AI researchers with the resources and scale they need to make breakthrough discoveries. This is more than just a job; it’s an opportunity to push the boundaries of technology and help redefine what’s possible in the world of AI and cloud computing. Work alongside a world-class team of engineers who share your passion for innovation and excellence. Plus, don’t you think the future of AI looks so bright it’s just, well… GLIMMERING?

What You’ll Be Doing

As a Software Engineer specializing in backend development, you’ll collaborate within a dedicated team to elevate the infrastructure and products that power NVIDIA’s AI platforms. Your work will be essential in enabling cutting-edge AI research, with a focus on:

  • Developing innovative solutions at the intersection of machine learning, distributed systems, and high-performance computing, directly contributing to the advancement of AI technologies.
  • Designing, developing, and optimizing (micro-)services orchestrated by Kubernetes to enable large-scale AI training workflows on AI training supercomputers hosted at major CSPs, emphasizing both resilience and efficiency.
  • Co-designing and implementing APIs that seamlessly integrate these services with NVIDIA’s comprehensive resilience stack, ranging from tier-0 telemetry to break/fix automation and advanced checkpointing systems.
  • Crafting a user-friendly submission abstraction that empowers model engineers and training platforms/frameworks to submit long-running training jobs effortlessly, hiding the underlying complexity of infrastructure failures, job lifecycle management with auto-restarts, and ensuring optimal resource utilization, while providing clear and timely feedback.
  • Developing modular services that can be easily coordinated with and deployed onto on-premises AI clusters utilizing NVIDIA Hardware and Cloud services.

What We Need To See

  • Bachelor’s degree or higher in Computer Science or a related technical field (or equivalent experience).
  • 5+ years of hands-on experience in backend development, proficient in languages such as Python, Go, C/C++, or similar high-performance languages.
  • Proven track record of building and scaling large-scale distributed systems.
  • Experience with leading cloud computing platforms like AWS, Azure, and GCP, as well as container technologies such as Docker and Kubernetes, and HPC/AI platforms like Slurm.

Ways To Stand Out From The Crowd

  • Real-world experience with popular DL frameworks and orchestrators like PyTorch, TensorFlow, JAX, and Ray.
  • Experience in developing framework plugin architectures that enable seamless integration between frameworks and cluster schedulers, transparent to end-users.
  • Strong understanding of NVIDIA GPUs, network technologies, and their associated failure patterns.
  • Experience with AI models and AI-based tools.
  • Demonstrable contributions to open-source projects or personal code repositories.

NVIDIA is at the forefront of the AI revolution, pioneering groundbreaking advancements in Artificial Intelligence, High-Performance Computing, and Visualization. Our GPU, the visual cortex of modern computers, is at the heart of our products and services, enabling new frontiers in exploration, creativity, and discovery. From artificial intelligence to autonomous vehicles, our work is transforming science fiction into reality.

NVIDIA is seeking exceptional individuals like you to join us in accelerating the next wave of artificial intelligence!

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is $184,000 – $287,500 for Level 4, and $224,000 – $356,500 for Level 5. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until August 10, 2025.

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We highly value diversity in our current and future employees and do not discriminate based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

NVIDIA is the world leader in accelerated computing, pioneering solutions to tackle challenges that were once deemed impossible. Our work in AI and digital twins is transforming industries and profoundly impacting society. Learn more about NVIDIA.

Please mention the word GLIMMERING and tag RMzguNjguMTM0LjE5NA== when applying to show you read the job post completely (#RMzguNjguMTM0LjE5NA==). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they’re human.

Build Your CV for remote jobs in Minutes

Latest Jobs

Similar Jobs

Jito Foundation
๐Ÿ“ Remote, North America ๐Ÿ“… Aug 21, 2025
The Ambr Group
๐Ÿ“ Manilla, Philippines ๐Ÿ“… Aug 21, 2025
3Pillar
๐Ÿ“ India Remote ๐Ÿ“… Aug 21, 2025
Future of Life Organizations
๐Ÿ“ Anywhere (Open Globally) ๐Ÿ“… Aug 21, 2025