LRZ adds Mega AI Aystem as it piles on future computing systems
The battle between high performance computing centers to pile on edge computers for faster time to science is getting torrid as new chip technologies become mainstream.
A European supercomputing hub near Munich called Leibniz Supercomputing Center is deploying Cerebras Systems’ CS-2 AI system as part of an internal initiative called Future Computing to evaluate alternative computing technologies to accelerate scientific research of the region.
“The idea is exactly to explore these new technologies and see how they would adapt to the needs of scientists and all that they really need to do their groundbreaking research,” said Dieter Kranzlmüller, director of the computing center intensive, also known as Leibniz-Rechenzentrum or LRZ.
LRZ thinks less about HPC systems and AI systems, and more in “terms of character realization of workflows and work to be done and what makes sense for architectures,” said Laura Schulz, head of the LRZ strategy. She added that the CS-2 AI system will be part of a larger supercomputing backbone that will be available to researchers in the Bavarian region.
“We have multiple GPUs, we have FPGAs, we have a variety of CPUs, we have prototypes, engineering samples, a really nice build, and so we’re trying to get things done,” Shulz said.
LRZ is one of the top three supercomputing centers in Germany, the others being the Jülich Supercomputing Center, which hosts the eighth fastest supercomputer in the world, called JUWELS, according to the Top500 list, and the High Performance Computing Center from Stuttgart, which hosts the 43rd row. Hazel Hen Supercomputer.
The CS-2’s Wafer Scale Engine 2 chip has 850,000 cores and 40 GB of memory. The chip is the size of a wafer and with 2.6 trillion transistors it is considered the largest chip in the world. The CS-2 is connected to HPE’s Superdome Flex, which can be termed as staging hardware for CS-2 to perform faster calculations on complex data sets.
Large AI models require huge datasets, and the HPE server has a large shared memory that allows the system to handle pre- and post-processing tasks very quickly during the training process. This is made possible by a large number of I/O slots with high bandwidth connectivity, so there are no bottlenecks in data transfers to the CS-2.
A complete data set can be kept in the Superdome Flex and serve the deep learning and training processes that take place on the CS-2, which involves a lot of data movement, said Arti Garg, chief strategist from HPE for AI.
“HPE servers solve a different problem. They enter the data into the CS-2. Large models require huge data sets. Datasets are processed and sent to CS-2 by the HPE system,” Cerebras Systems CEO Andrew Feldman said in an email exchange.
The HPE server simplifies orchestration and helps with convergence and accuracy when training AI models. The CS-2 has the ability to run multiple machine learning models simultaneously.
As datasets get larger, conventional computational approaches to AI take longer to produce results, and that’s where new types of accelerated systems like CS-2 fit in, said Andy Hock, vice president of product management at Cerebras Systems.
He gave an example of natural language processing, with increasingly large models that computational needs have increased more than 1,800 times in the space of two years. The BERT model had 110 million parameters in 2018, and the newer GPT-3 – which is considered a spin-off of BERT – reached 175 billion parameters in 2020.
“We don’t see this trend weakening. We have seen the introduction of larger models since then, in the trillion-parameter range, and we expect that in the near future, state-of-the-art models may be in the trillion-parameter range,” said Hock said.
CS-2 cores are identical and fully programmable, and are optimized for the kind of machine learning computational operations that are common on large-scale AI and HPC workloads.
Hock said CS-2 can be considered a huge sparse linear algebra accelerator, as each core is directly connected to its four nearest neighbors across the entire device via a high-bandwidth interconnect and low latency. The data flow traffic pattern between cores is fully programmable at compile time.
The interconnect transfers data at 220 petabytes per second, and the WSE-2 keeps neural network settings on-chip while it runs, which speeds up computation. Hooks for multi-billion parameter models are stored in MemoryX technology announced by Cerebras last year, which handles neural networks with up to 120 trillion parameters.
“The technology allows us to keep the parameters off-chip, but get the performance as if it were on-chip. By disaggregating compute and memory, MemoryX allows researchers to run models 100 times larger than the largest large current models on a single CS-2,” said Feldman of Cerebras.
Developers can use standard ML framework approaches such as TensorFlow and PyTorch to program CS-2. The compiler intercepts the program at compile time and translates the program into an executable that can run on CS-2 devices.
Cerebras also has a lower-level software development kit aimed at HPC users for projects ranging from signal processing to physics-based modeling and simulation.
“We continue to improve this version of the stack, bringing more and more features to increase the range of applications users can work on,” Hock said.
Wafer Scale to ‘Brain-Scale’ – Cerebras offers linear scaling of up to 192 CS-2 systems
PSC Neocortex Upgrades to Cerebras CS-2 AI Systems
Cerebras doubles the AI performance with a second-generation 7nm wafer scaling engine