Mellanox® is an industry leader in the implementation of the Data-Centric data center architecture, an architecture that enables In-Network Computing – allowing data to be analyzed everywhere and in particular as it being transferred within the network. This approach eliminates the bottlenecks that exist in a traditional CPU-Centric data center architecture and provides the highest level of applications performance and scalability. Mellanox solutions for the Data-Centric architecture include both hardware and software elements to meet the needs of the next generation of high-performance, artificial intelligence, deep learning, big data and other data intensive applications.
The key In-Networking Computing technologies provided with Mellanox interconnect solutions include:
- SHARP – an acronym for Scalable Hierarchical Aggregation and Reduction Protocol, SHARP supports communication framework-related computation on Mellanox switch and HCA hardware, enabling data reduction and aggregation algorithms to be managed and executed by the network. SHARP is being used for HPC application to offload collectives communication and to accelerate their performance.
- MPI Tag Matching – Mellanox ConnectX®-5 and later HCAs offload the MPI tag matching protocol from the CPU to the network, and are able to match MPI tags and process requests without CPU involvement. This results in latency improvements and a dramatic reduction in CPU utilization for MPI operations from 90% to nearly 0%.
- The latest addition of GPUDirect® RDMA is GPUDirect® ASYNC. GPUDirect ASYNC protocol, similarly offloads the data path, but additionally offloads the control plane operations from the CPU, enabling the GPU to schedule network communications. This results in additional acceleration of GPU-Network communications, much improved power savings, and the option to use less expensive CPU’s which lowers the cost of the overall platform.
- SHIELD – an acronym for Self-Healing Interconnect Enhancement for Intelligent Datacenters, SHIELD enables link fault recovery 5000x faster than a subnet manager would be able to react. This prevents communications protocols from timing out, eliminating performance crushing retries and communication-related job failures, enhancing the productivity and maximizing return on investment. Please click here to learn more about SHIELD
Additionally, routing algorithms provide a methodology for the InfiniBand subnet manager to find optimal routes for all traffic on a given fabric architecture. Mellanox supports both topology-specific and general algorithms that extend support for many different topologies In addition to fat tree, such as multi-dimensional torus up to 6D, hypercube and enhanced hypercube, mesh, and most recently, Dragonfly+. The latter is worth of specific note as it provides an interconnected group topology that can scale to very large node counts with advantages for expandability, maximizing cross sectional bandwidth with advanced adaptive routing, and providing strategies for saving both cabling cost and complexity.
Bioscience data centers leverage HPC technology for both research and health care. Reducing pipeline wait times for and improving productivity not only speeds the pace of basic research, it can save lives! These data centers are challenged by applications that are both big data and high-performance computing problems that would suffer from suboptimal performance of the network that links servers and storage.
Electronic Design Automation involves 3D modeling, fluid dynamics, and other compute-intensive processes that require high-performance computing (HPC) data center solutions. These applications use highly-coupled parallel algorithms that can be very sensitive to communication fabric and framework latencies. Mellanox InfiniBand with SHARP can deliver upwards of 10x improvements for these applications.
Mechanical computer-aided design (MCAD) and computer-aided engineering (CAE) systems have adopted HPC cluster computing environments to speed processing times and improve revenue for new products. Mellanox InfiniBand with RDMA enables complex systems to be modelled using large memory models in real time to produce higher degrees of accuracy and optimize manufacturing processes for improved safety, productivity, and/or costs.
Media and Entertainment
Today’s media data centers invest in high-performance HPC cluster technology, combining the power of hundreds or thousands of processors in the service of highly complex rendering tasks Animation and augmented reality are essential to today’s production methods and require the best performing network for highly parallel computing both on CPUs and GPUs with GPU-direct, in the presence of massive data storage requirements.
Oil and Gas Industry Modeling
Oil and gas companies use HPC technology to explore and locate new repositories for refinement in a highly competitive industry. Success depends on expensive, very high resolution data collection and complex modelling/analysis techniques. These computing challenges involve processing massive amounts of data to reduce the cost of locating and harvesting raw fuel products and optimizing production.
Weather and Astronomy
Meteorological forecasting and research requires high-speed processing of massive data inputs, often n real-time; so productivity depends on processing speed. Ocean modelling and astronomy face similar challenges of computing very large scale simulations. Data centers used in these types of research use HPC cluster technology, combining the power of thousands of CPUs with large, high-speed storage systems.