About The Company

Founded in 2004 in sunny San Diego, California, ServiceNow has grown into a global leader in cloud computing and enterprise software solutions. Driven by a visionary approach to transforming work processes, the company has established itself as a pioneer in AI-enhanced technology, serving over 8,100 customers worldwide, including 85% of the Fortune 500®. ServiceNow's intelligent cloud-based platform seamlessly connects people, systems, and processes, empowering organizations to operate more efficiently, innovatively, and securely. With a commitment to making the world work better for everyone, ServiceNow continues to innovate and expand its offerings, leveraging advanced AI and machine learning capabilities to shape the future of work.

About The Role

We are seeking a highly skilled and motivated Staff Machine Learning Engineer to join our Platform Engineering and AI Technology Organization (PLATO) at ServiceNow. This role is pivotal in building and maintaining our AI infrastructure, deploying scalable AI workloads, and ensuring high performance and reliability of our GPU clusters. The successful candidate will collaborate closely with researchers, AI engineers, and infrastructure teams to develop robust, efficient, and innovative AI platforms that enable end-to-end AI-powered work experiences for our customers. This position requires being onsite in our Santa Clara office for two days per week, offering a dynamic environment focused on cutting-edge AI and platform engineering. As part of our team, you will contribute to the continuous improvement of our operational practices, develop reusable code, and mentor colleagues to foster a culture of knowledge sharing and technical excellence.

Qualifications

4+ years of development experience with Python, GoLang, Java, or similar programming languages
4+ years of experience operating highly available distributed workloads on Kubernetes following a DevOps approach
Proficiency in leveraging or critically analyzing AI integration into workflows, decision-making, or problem-solving
Experience with prompt engineering and developing features based on large language models (LLMs)
Hands-on experience with training and fine-tuning large language models, including distillation, supervised fine-tuning, and policy optimization
Experience operating LLMs on NVIDIA GPUs
Strong experience with DevOps tooling such as Helm, Ansible, Kubernetes, Prometheus, Splunk, and GitLab CI
Proficiency in operating distributed systems built on Linux and J2EE
Knowledge of software-defined networking, infrastructure as code, and configuration management
Experience developing secure and compliant software for regulated environments
Ability to manage projects with significant technical risks and drive outcomes effectively
Preferred: 4+ years of experience in infrastructure and platform operations, deployments, SRE, and continuous platform health improvement

Responsibilities

Design, develop, and implement infrastructure and platform features that support AI workloads, ensuring scalability and performance
Collaborate with cross-functional teams including researchers, AI engineers, and infrastructure specialists to optimize GPU cluster performance and reliability
Enhance operational practices by translating operational use cases into software tooling requirements for Site Reliability Engineering (SRE)
Support deployment activities and provide ongoing support for AI/ML developers to facilitate smooth product delivery
Develop high-quality, clean, scalable, and reusable code adhering to best practices such as code reviews and unit testing
Engage with product owners to understand detailed requirements, owning the full development lifecycle from design through testing and deployment
Operate and optimize large language models on NVIDIA GPUs, ensuring efficient performance
Mentor colleagues, promote knowledge sharing, and foster a culture of continuous learning and innovation

Benefits

Competitive base salary ranging from $173,100 to $303,000, depending on experience and location
Equity options and variable/incentive compensation programs
Comprehensive health plans including medical, dental, and vision coverage
Flexible spending accounts and a 401(k) plan with company match
Employee Stock Purchase Program (ESPP) and matching donations
Flexible time-off policies and family leave programs to support work-life balance
Opportunities for professional development and career growth within a global organization

Equal Opportunity

ServiceNow is an equal opportunity employer. We are committed to creating an inclusive environment where all qualified applicants receive consideration for employment regardless of race, color, creed, religion, sex, sexual orientation, gender identity or expression, national origin, age, disability, veteran status, or any other protected category. We also consider qualified applicants with arrest or conviction records in accordance with applicable laws.

Machine Learning Engineer

Job description