Site Reliability Engineer - Metrics
Job Description
Nvidia has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. Nvidia is a “learning machine” that constantly evolves by seeking new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human creativity and intelligence. Make the choice to join us today!
As an SRE focused on metrics reporting, you will collaborate closely with cross-functional teams, including software engineers, data scientists, and operations, to monitor, analyze, and optimize our systems. Your primary responsibility will be to collect, analyze, and present key performance indicators (KPIs) that drive operational excellence and inform strategic decisions.
What you’ll be doing:
Develop, test, and deploy data collectors, pipelines, and services to enhance use of our AI/ML and chip development infrastructure
Participate in the full life-cycle of tool development, test, and deployment.
Work in a diverse team to provide operational and strategic metrics which empower our engineers to develop at the speed of light.
Continuously improve our chip develop process through better observability
Directly contribute to the overall quality and improve time to market for our next generation chips.
What we need to see:
Experience in applying data analysis principles and influencing data-driven decisions
Experience with turning raw data into actionable reports
Hands-on experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open source tools
Authoritative level Python programming experience and use of API calls
Extensive experience with CI/CD pipelines such as Jenkins and/or GitLab
Passion for improving the productivity of others
Excellent planning and interpersonal skills
Flexibility/adaptability working in a dynamic environment with changing requirements
MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience.
5+yrs of relevant experience.
Ways to stand out from the crowd:
Hands-on experience running GPU-based workloads in a batch computing environment
Passion for gathering and visualizing metrics and data
Experience with chip design workflows, such as front end verification, back end workflows, or mixed signal workflows
Experience with job schedulers (in particular IBM Spectrum LSF and/or SLURM)
Mastery of distributed system principles
NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.
The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.