PENGUIN SOLUTIONS
AI and Accelerated Computing Infrastructures at Scale
1
1. | About Penguin | |
AGENDA | 3. | YOUR TRUSTED ADVISOR FOR AI FACTORIES |
Penguin's Scyld Suite | ||
2. | Delivering AI Factory at Scale | |
CHALLENGES AND SOLUTIONS | ||
YOUR AI FACTORY UNIFIED CONTROL PLANE | ||
4. | Your End-to-End AI Factory Journey | |
GUIDANCE AND FLEXIBILITY | ||
© 2023 Penguin Solutions. All Rights Reserved | | 2 |
About Penguin
Penguin Solutions, an SGH company, designs, builds, deploys, and manages AI and accelerated computing infrastructures at scale.
Tailored: We deliver highly tuned solutions that are designed to significantly improve results.
Proven: We have over 24 years of HPC experience and have deployed over 50,000 GPUs in partnership with leaders in AI.
Innovative: We continually incorporate the best elements and technologies.
24 YEARS OF EXPERIENCE
AI/HPC
AI CLUSTER
MANAGEMENT
Scyld Suite
SGH
NASDAQ
BROAD MARKET
EXPERTISE
Commercial, Fed, DoD
NVIDIA
CERTIFIED
DGX Managed
Service Provider
FLEXIBILITY
On Prem, Cloud, and Hybrid
© 2023 Penguin Solutions. All Rights Reserved | | 3 |
Experts at AI Factory Infrastructure
Enterprise infrastructure designed to support advanced workloads with optimal performance and stability at scale
AI FACTORY
AI & HPC Applications
HPC Workloads | Analytics/ML Workloads | AI/DL Workloads | ||
Design Build Deploy Manage
Data Center | Cloud | |||||
Software Technologies | Software Technologies | |||||
Compute & GPU Technologies | Compute & GPU Technologies | |||||
Data Technologies | Data Technologies | |||||
Data Center Infrastructure | Secure Cloud Infrastructure | |||||
User Workloads
Platform Foundation
© 2023 Penguin Solutions. All Rights Reserved | | 4 |
Penguin: Proven AI Factory Delivery at Scale
ChatGPT trained on | Google A3 cluster | Penguin Manages | ||||
10,000 GPUs | 26,000 GPUs | 50,000+ GPUs | ||||
5
Customer Profile - Ultrascale Company
Business Challenge
- Urgently needed large-scale AI platform capable to support key product and innovation initiatives
- Co-developmentof a leading-edge technology solution
- DevOps needs for HPC/AI at scale
Super Scale AI Infrastructure Platform
- Over 10,000 Nvidia A100 GPUs
- Hundreds of Petabytes of data storage capacity
- 200 Gb/s HDR InfiniBand per GPU
- exaFLOPS of mixed precision compute
- Delivered in 2022 with continued Penguin-provided managed services
© 2023 Penguin Solutions. All Rights Reserved | | 6 |
Customer Profile - Global Energy Provider
Business Challenge
- Improve data center sustainability without compromising performance
- Optimize compute density, power utilization, and cooling
- Obtain end-to-end system build, deployment, and support
Immersion Cooled AI Infrastructure
- Single-phaseimmersion cooling
- Dense NVIDIA A100 GPUs in Penguin AI servers
- Designed, delivered and supported by Penguin
- Reduced environmental impact from data center cooling
- Ongoing development and deployment of systems since 2021
© 2023 Penguin Solutions. All Rights Reserved | | 7 |
Customer Profile - Federal System Integrator
Business Challenge
- Hybrid Cloud solution needs for AI and HPC workloads
- Requirements from 5 separate federal agencies
- Desire for full DevOps-based managed services
Designed AI & HPC Solution
and Deployed in Colo
- Thousands of nodes of compute capability
- CPU, GPU and InfiniBand network infrastructure
- Multiple petabytes of data storage
- Penguin services provides all HW, SW & DevOps management
- Hundreds of users from diverse teams
- 25x compute capacity growth over 2.5 years
© 2023 Penguin Solutions. All Rights Reserved | | 8 |
Working" in partnership with our implementation partner, Penguin Computing, we improved our overall cluster management. By the time we completed the second phase of building RSC, availability stayed above 95 percent on a consistent basis. This was no small feat given that we added a 10K GPU cluster while concurrently running multiple research projects.
We now have a template for building large GPU clusters that is repeatable and reliable. "
18 May 2023
© 2023 Penguin Solutions. All Rights Reserved | | 9 |
Challenges - AI Factories and Accelerated Computing at Scale
COMPLEX
DESIGN
- New workload & architecture
- AI clusters are highly sensitive at scale
- Requirement to blend multiple networks & processor types
INTRICATE
BUILD
- Supply chain delays with specialty critical components
- Complex build process to achieve throughput
- Lack of software suite for deployment at scale
LENGTHY DEPLOYMENT
- Complex on-site power, cooling, network, and security integration
- Need for pre-production performance and throughput validation
- Production cluster monitoring
PRECISION MANAGEMENT
- Specialty components with unique failure signatures
- Performance issues drive significant financial impact
- User workloads interrupted and training time lost
© 2023 Penguin Solutions. All Rights Reserved | | 10 |
Attachments
- Original Link
- Original Document
- Permalink
Disclaimer
SMART Global Holdings Inc. published this content on 19 October 2023 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 19 October 2023 22:58:30 UTC.