Job title: Software Engineer, AI Infrastructure (Training + Inference) / Member of Technical Staff
Who We Are
WaveForms AI is an Audio Large Language Models (LLMs) company building the future of audio intelligence through advanced research and products. Our models will transform human-AI interactions making them more natural, engaging and immersive.
Role overview: The Software Engineer, AI Infrastructure (Training + Inference) will be responsible for designing, building, and optimizing the infrastructure that powers our large scale training and real-time inference pipelines. This role combines expertise in distributed computing, system reliability, and performance optimization. The candidate will collaborate with researchers with a focus on building scalable systems to support novel multimodal training and maintaining uptime to deliver consistent results for real-time applications.
Key Responsibilities Infrastructure Development: Design and implement infrastructure to support large-scale AI training and real-time inference with a focus on multimodal inputs..
Distributed Computing: Build and maintain distributed systems to ensure scalability, efficient resource allocation, and high throughput.
Training Stability: Monitor and enhance the stability of training workflows by addressing bottlenecks, failures, and inefficiencies in large-scale AI pipelines.
Real-time Inference Optimization: Develop and optimize real-time inference systems to deliver low-latency, high-throughput results across diverse applications.
Uptime & Reliability: Implement tools and processes to maintain high uptime and ensure infrastructure reliability during both training and inference phases.
Performance Tuning: Identify and resolve performance bottlenecks, improving overall system throughput and response times.
Collaboration: Work closely with research and engineering teams to integrate infrastructure with AI workflows, ensuring seamless deployment and operation.
Required Skills & Qualifications Distributed Systems Expertise: Proven experience in designing and managing distributed systems for large-scale AI training and inference.
Infrastructure for AI: Strong background in building and optimizing infrastructure for real-time AI systems, with a focus on multimodal data (audio + text).
Performance Optimization: Expertise in optimizing resource utilization, improving system throughput, and reducing latency in both training and inference.
Training Stability: Experience in troubleshooting and stabilizing AI training pipelines for high reliability and efficiency.
Technical Proficiency: Strong programming skills (Python preferred), proficiency with PyTorch, and familiarity with cloud platforms (AWS, GCP, Azure).
#J-18808-Ljbffr