Site Reliability Engineer - Metrics

Nvidia

Durham, NC

Job posting number: #7273721 (Ref:JR1981347)

Posted: August 17, 2024

Job Description

Nvidia has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. Nvidia is a “learning machine” that constantly evolves by seeking new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human creativity and intelligence. Make the choice to join us today!

As an SRE focused on metrics reporting, you will collaborate closely with cross-functional teams, including software engineers, data scientists, and operations, to monitor, analyze, and optimize our systems. Your primary responsibility will be to collect, analyze, and present key performance indicators (KPIs) that drive operational excellence and inform strategic decisions.

What you’ll be doing:

  • Develop, test, and deploy data collectors, pipelines, and services to enhance use of our AI/ML and chip development infrastructure

  • Participate in the full life-cycle of tool development, test, and deployment.

  • Work in a diverse team to provide operational and strategic metrics which empower our engineers to develop at the speed of light.

  • Continuously improve our chip develop process through better observability

  • Directly contribute to the overall quality and improve time to market for our next generation chips. 

What we need to see:

  • Experience in applying data analysis principles and influencing data-driven decisions

  • Experience with turning raw data into actionable reports

  • Hands-on experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open source tools

  • Authoritative level Python programming experience and use of API calls

  • Extensive experience with CI/CD pipelines such as Jenkins and/or GitLab

  • Passion for improving the productivity of others

  • Excellent planning and interpersonal skills

  • Flexibility/adaptability working in a dynamic environment with changing requirements 

  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience.

  • 5+yrs of relevant experience. 

Ways to stand out from the crowd:

  • Hands-on experience running GPU-based workloads in a batch computing environment

  • Passion for gathering and visualizing metrics and data

  • Experience with chip design workflows, such as front end verification, back end workflows, or mixed signal workflows

  • Experience with job schedulers (in particular IBM Spectrum LSF and/or SLURM)

  • Mastery of distributed system principles

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.





Apply Now

Please mention to the employer that you saw this ad on CollegeJobs.com

Job posting number:#7273721 (Ref:JR1981347)
Application Deadline:Open Until Filled
Employer Location:Nvidia
Santa Clara,California
United States
More jobs from this employer