PENGUIN SOLUTIONS

AI and Accelerated Computing Infrastructures at Scale

1

1.

About Penguin

AGENDA

3.

YOUR TRUSTED ADVISOR FOR AI FACTORIES

Penguin's Scyld Suite

2.

Delivering AI Factory at Scale

CHALLENGES AND SOLUTIONS

YOUR AI FACTORY UNIFIED CONTROL PLANE

4.

Your End-to-End AI Factory Journey

GUIDANCE AND FLEXIBILITY

© 2023 Penguin Solutions. All Rights Reserved |

2

About Penguin

Penguin Solutions, an SGH company, designs, builds, deploys, and manages AI and accelerated computing infrastructures at scale.

Tailored: We deliver highly tuned solutions that are designed to significantly improve results.

Proven: We have over 24 years of HPC experience and have deployed over 50,000 GPUs in partnership with leaders in AI.

Innovative: We continually incorporate the best elements and technologies.

24 YEARS OF EXPERIENCE

AI/HPC

AI CLUSTER

MANAGEMENT

Scyld Suite

SGH

NASDAQ

BROAD MARKET

EXPERTISE

Commercial, Fed, DoD

NVIDIA

CERTIFIED

DGX Managed

Service Provider

FLEXIBILITY

On Prem, Cloud, and Hybrid

© 2023 Penguin Solutions. All Rights Reserved |

3

Experts at AI Factory Infrastructure

Enterprise infrastructure designed to support advanced workloads with optimal performance and stability at scale

AI FACTORY

AI & HPC Applications

HPC Workloads

Analytics/ML Workloads

AI/DL Workloads

Design Build Deploy Manage

Data Center

Cloud

Software Technologies

Software Technologies

Compute & GPU Technologies

Compute & GPU Technologies

Data Technologies

Data Technologies

Data Center Infrastructure

Secure Cloud Infrastructure

User Workloads

Platform Foundation

© 2023 Penguin Solutions. All Rights Reserved |

4

Penguin: Proven AI Factory Delivery at Scale

ChatGPT trained on

Google A3 cluster

Penguin Manages

10,000 GPUs

26,000 GPUs

50,000+ GPUs

5

Customer Profile - Ultrascale Company

Business Challenge

  • Urgently needed large-scale AI platform capable to support key product and innovation initiatives
  • Co-developmentof a leading-edge technology solution
  • DevOps needs for HPC/AI at scale

Super Scale AI Infrastructure Platform

  • Over 10,000 Nvidia A100 GPUs
  • Hundreds of Petabytes of data storage capacity
  • 200 Gb/s HDR InfiniBand per GPU
  • exaFLOPS of mixed precision compute
  • Delivered in 2022 with continued Penguin-provided managed services

© 2023 Penguin Solutions. All Rights Reserved |

6

Customer Profile - Global Energy Provider

Business Challenge

  • Improve data center sustainability without compromising performance
  • Optimize compute density, power utilization, and cooling
  • Obtain end-to-end system build, deployment, and support

Immersion Cooled AI Infrastructure

  • Single-phaseimmersion cooling
  • Dense NVIDIA A100 GPUs in Penguin AI servers
  • Designed, delivered and supported by Penguin
  • Reduced environmental impact from data center cooling
  • Ongoing development and deployment of systems since 2021

© 2023 Penguin Solutions. All Rights Reserved |

7

Customer Profile - Federal System Integrator

Business Challenge

  • Hybrid Cloud solution needs for AI and HPC workloads
  • Requirements from 5 separate federal agencies
  • Desire for full DevOps-based managed services

Designed AI & HPC Solution

and Deployed in Colo

  • Thousands of nodes of compute capability
  • CPU, GPU and InfiniBand network infrastructure
  • Multiple petabytes of data storage
  • Penguin services provides all HW, SW & DevOps management
  • Hundreds of users from diverse teams
  • 25x compute capacity growth over 2.5 years

© 2023 Penguin Solutions. All Rights Reserved |

8

Working" in partnership with our implementation partner, Penguin Computing, we improved our overall cluster management. By the time we completed the second phase of building RSC, availability stayed above 95 percent on a consistent basis. This was no small feat given that we added a 10K GPU cluster while concurrently running multiple research projects.

We now have a template for building large GPU clusters that is repeatable and reliable. "

18 May 2023

© 2023 Penguin Solutions. All Rights Reserved |

9

Challenges - AI Factories and Accelerated Computing at Scale

COMPLEX

DESIGN

  • New workload & architecture
  • AI clusters are highly sensitive at scale
  • Requirement to blend multiple networks & processor types

INTRICATE

BUILD

  • Supply chain delays with specialty critical components
  • Complex build process to achieve throughput
  • Lack of software suite for deployment at scale

LENGTHY DEPLOYMENT

  • Complex on-site power, cooling, network, and security integration
  • Need for pre-production performance and throughput validation
  • Production cluster monitoring

PRECISION MANAGEMENT

  • Specialty components with unique failure signatures
  • Performance issues drive significant financial impact
  • User workloads interrupted and training time lost

© 2023 Penguin Solutions. All Rights Reserved |

10

Attachments

  • Original Link
  • Original Document
  • Permalink

Disclaimer

SMART Global Holdings Inc. published this content on 19 October 2023 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 19 October 2023 22:58:30 UTC.