Software Engineer, Ai Infrastructure (Training + Inference)

Details of the offer

Job title: Software Engineer, AI Infrastructure (Training + Inference) / Member of Technical Staff
Who We Are
WaveForms AI is an Audio Large Language Models (LLMs) company building the future of audio intelligence through advanced research and products. Our models will transform human-AI interactions making them more natural, engaging and immersive.
Role overview: The Software Engineer, AI Infrastructure (Training + Inference) will be responsible for designing, building, and optimizing the infrastructure that powers our large scale training and real-time inference pipelines. This role combines expertise in distributed computing, system reliability, and performance optimization. The candidate will collaborate with researchers with a focus on building scalable systems to support novel multimodal training and maintaining uptime to deliver consistent results for real-time applications.
Key Responsibilities Infrastructure Development: Design and implement infrastructure to support large-scale AI training and real-time inference with a focus on multimodal inputs..
Distributed Computing: Build and maintain distributed systems to ensure scalability, efficient resource allocation, and high throughput.
Training Stability: Monitor and enhance the stability of training workflows by addressing bottlenecks, failures, and inefficiencies in large-scale AI pipelines.
Real-time Inference Optimization: Develop and optimize real-time inference systems to deliver low-latency, high-throughput results across diverse applications.
Uptime & Reliability: Implement tools and processes to maintain high uptime and ensure infrastructure reliability during both training and inference phases.
Performance Tuning: Identify and resolve performance bottlenecks, improving overall system throughput and response times.
Collaboration: Work closely with research and engineering teams to integrate infrastructure with AI workflows, ensuring seamless deployment and operation.
Required Skills & Qualifications Distributed Systems Expertise: Proven experience in designing and managing distributed systems for large-scale AI training and inference.
Infrastructure for AI: Strong background in building and optimizing infrastructure for real-time AI systems, with a focus on multimodal data (audio + text).
Performance Optimization: Expertise in optimizing resource utilization, improving system throughput, and reducing latency in both training and inference.
Training Stability: Experience in troubleshooting and stabilizing AI training pipelines for high reliability and efficiency.
Technical Proficiency: Strong programming skills (Python preferred), proficiency with PyTorch, and familiarity with cloud platforms (AWS, GCP, Azure).

#J-18808-Ljbffr


Nominal Salary: To be agreed

Source: Jobleads

Requirements

Manager, Software Engineering

Company Overview Docusign brings agreements to life. Over 1.5 million customers and more than a billion people in over 180 countries use Docusign solutions t...


Docusign, Inc. - California

Published 14 days ago

Corporate Functions Opportunities Mountain View, Ca (Remote)

At Groq, we believe AI will change humanity forever, and that making it affordable and universally accessible is the key to human agency in an AI economy. We...


Groq Inc. - California

Published 14 days ago

Technical Account Manager

Team Description Pendo's Technical Account Managers play a crucial role by providing proactive, strategic, and technical guidance to ensure customers are abl...


Pendo - California

Published 14 days ago

Senior Software Engineer, Fullstack (Cdp Api)

The Coinbase Developer Platform APIs are the easiest way for developers to get started with building crypto applications. Conceived by Coinbase CEO Brian Arm...


Coinbase Developer Platform - California

Published 14 days ago

Built at: 2024-12-23T14:18:07.381Z