Senior Site Reliability Engineer ML Platforms

2100 NVIDIA USA

📍 US, CA, Santa Clara 💰 $60k - $135k

Apply Now 📅 4 days ago

🧠 cloud code devops engineer engineering go leader management operational python reliability senior software support system training

Job Description

Join NVIDIA: Senior Site Reliability Engineer (SRE), Data Science & ML Platform

Are you ready to play a pivotal role at the heart of NVIDIA’s data-driven culture? We are seeking an exceptional Senior Site Reliability Engineer (SRE) to join our Data Science & ML Platform team. This is a unique opportunity to architect, build, and maintain the resilient, high-performance systems that power cutting-edge AI and Machine Learning applications on a massive scale.

At NVIDIA, we are pioneers in Accelerated Computing, Artificial Intelligence, and High-Performance Computing. Our invention, the GPU, is transforming industries and enabling breakthroughs that were once unimaginable. As a Senior SRE, you’ll contribute directly to this mission by ensuring the stability and efficiency of the platforms that enable our data scientists and engineers to push the boundaries of AI and discovery.

What You Will Do

As a Senior SRE on our team, your contributions will be crucial in building and maintaining highly available and efficient systems. Your responsibilities will include:

Engineering and deploying robust software and systems solutions to guarantee the reliability and operational excellence of our Large-Scale Production Systems, critical for machine-learning workflows.
Deeply understanding system architecture, operational patterns, scalability challenges, and potential failure modes to proactively identify risks and drive systemic improvements.
Developing innovative tools and automation to streamline operations, reduce manual effort, and enhance team velocity.
Establishing robust frameworks, processes, and best practices to elevate operational maturity, improve team efficiency, and accelerate innovation cycles.
Defining, tracking, and improving system and service reliability through meaningful, actionable SLOs and other key reliability metrics.
Managing Capacity Planning and performance optimization to support infrastructure scaling across global Cloud Environments (public and private).
Building and enhancing our Observability infrastructure to improve monitoring, logging, and tracing for faster incident detection and resolution.
Practicing sustainable incident response and fostering a culture of Blameless Postmortems and continuous learning.

What We’re Seeking

To thrive in this role, you should possess a strong foundation in SRE principles and extensive experience managing complex systems. We’re looking for candidates with:

A minimum of 10 years of hands-on experience in Site Reliability Engineering (SRE), Cloud Platforms, or DevOps, specifically with Large-Scale Microservices in production environments.
A Master’s or Bachelor’s degree in Computer Science, Electrical Engineering, Computer Engineering, or equivalent practical experience.
A deep understanding of core SRE Principles, including Error Budgets, SLOs, and SLAs.
Solid experience with incident, change, and problem management processes.
Exceptional Problem-Solving, Root Cause Analysis, and system optimization skills.
Proven experience with Streaming Data Infrastructure services such as Kafka and Spark.
Expertise in building, operating, and scaling Observability Platforms for monitoring and logging (e.g., ELK, Prometheus).
Proficiency in one or more programming languages like Python, Go, Perl, or Ruby.
Hands-on experience scaling and operating Distributed Systems in public, private, or hybrid Cloud Environments.
Experience deploying, supporting, and supervising services, platforms, and application stacks.

Ways to Stand Out

Impress us with experience or skills that demonstrate a higher level of impact and capability:

Demonstrated success operating Large-Scale Distributed Systems with stringent SLAs.
Exceptional coding skills, particularly in Python and Go, with extensive experience in operating data platforms.
In-depth knowledge of CI/CD systems like Jenkins or GitHub Actions.
Proficiency with Infrastructure as Code (IaC) methodologies and tools.
Excellent Interpersonal Skills for collaborative problem-solving and communicating data-driven insights across teams.

Why Join NVIDIA?

When you join NVIDIA, you become part of a team that’s shaping the future. You’ll work on innovative technologies that power the next wave of AI and data science, alongside a dynamic and supportive team that values learning, growth, and intellectual curiosity. We offer the autonomy to tackle meaningful projects with the mentorship and resources needed for success. Our culture embraces diversity, encourages thoughtful risk-taking, and believes in iterative improvement through practices like Blameless Postmortems. This is an exciting career opportunity where your work makes a real difference in the world.

Compensation and Benefits

The base salary range for this position is 224,000 USD – 425,500 USD. Your actual base salary will be determined based on factors including your location, experience, and the compensation of employees in similar roles. In addition to base salary, this position is eligible for equity awards and comprehensive benefits.

Equal Opportunity Employer

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We highly value diversity in our current and future employees and do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

How to Apply

If you are excited by the prospect of driving reliability for cutting-edge AI platforms and have the skills and passion we’re looking for, we encourage you to apply! Applications are accepted on an ongoing basis.

To show you have read this job post completely, please mention the word CONTRIBUTION and tag RMzguNjguMTM0LjE5NA== when applying. This helps us identify candidates who pay close attention to detail.

Learn more about NVIDIA and our mission in accelerated computing.

“

Senior Site Reliability Engineer ML Platforms

Job Description

Join NVIDIA: Senior Site Reliability Engineer (SRE), Data Science & ML Platform

What You Will Do

What We’re Seeking

Ways to Stand Out

Why Join NVIDIA?

Compensation and Benefits

Equal Opportunity Employer

How to Apply

Want more remote jobs?UNLOCK 45,990 jobs!

Latest Jobs

Similar Jobs

Senior Backend Engineer

Staff Front End Engineer

Senior Scientific Software Engineer Magnet Protection Systems

Find 100% remote jobs from anywhere in the world, Best for digital nomads and remote workers. Whether you want full-time, part-time, or contract work, you can work from any where you choose. We currently have latest and updated remote job listing. Start your search today!

Jobs by Country

Jobs by Position Type

Jobs by Region

Jobs by Skill

Jobs by Category

Sign up for email job alerts

Thank you for sign up!