[Up to date with V-1.0.0.0 release]
This page outlines both software and hardware requirements as they relate to each other.This computing method should work on any system compatible with a C compiler. Worked on every system we tried. Despite its high portability, this method is very slow and should only be used for very light network architectures or pedagogical purposes. Most of the NAIV functions support multi-threading using OpenMP.
Rely on the OpenBLAS library, hence sharing the same requirements. Most of the non-BLAS functions support multi-threading using OpenMP. We recommend compiling OpenBLAS with the USE_OPENMP=1 flag for the best performance in CIANNA.
Recent OpenBLAS versions should support the most recent Intel, AMD (including Zen), and ARM-based CPUs. It worked flawlessly on every system we tried (non-exhaustive list):
Currently, CIANNA only supports Nvidia GPUs exclusively through CUDA (we plan to add a HIP version, with no ETA). It also only uses one GPU at a time; multi-GPU support may be added at some point (again, no ETA). Multiple GPUs can still be used through independent instances of CIANNA to explore the hyperparameter space. For inference, since predictions are independent, items can be split across any number of CIANNA instances with GPUs in parallel.
The minimum CUDA version is set to 9.0 since this is the first version to support the cuBLAS GemmEX operations (which also explains why Kepler GPU (sm_30 and sm_35) are not supported). We strongly recommend 10+, and the latest is usually the best (currently 13.1). The two main reasons for the absence of support for older versions are:
We emphasize that this is solely a software version requirement. In practice, CIANNA can run on a broad range of Nvidia GPU generations: Maxwell, Volta, Pascal, Turing, Ampere, Ada Lovelace, Hopper, and Blackwell. We successfully used CIANNA on a wide variety of GPUs (almost-exhaustive list): Quadro M1200, Quadro P2000m, Quadro P40, Quadro RTX 5000, Tesla T4, Tesla P100, Tesla V100, RTX A6000, RTX A5000, RTX 6000 Ada, Titan X, Titan Xp, Titan RTX, RTX 2080, RTX 3090, RTX 4090, H100, RTX 5090, ... (If you tested CIANNA on other GPUs, please send us some feedback.)
Note that several versions of CUDA can be installed on the same system as long as the most recent Nvidia driver is installed.
Mixed precision is an approach where all computations are done using a low-bit-count data format, but with a higher bit-count accumulator when necessary [usually FP32 (32-bit)]. Multiple low-resolution formats are available depending on the total bit count and their distribution over mantissa and exponent [TF32 (19-bit), FP16, BF16, FP8, FP4, INT8, INT4, ...]. Master weights are generally kept with high precision and a high bit count to remain sensitive to fine-grained updates, but weights with reduced precision are used for computation in the forward path (detailed information on Nvidia website).
Mixed precision improves the memory bandwidth, reduces/halves VRAM usage, speeds up computations, and enables usage of dedicated hardware Tensor-cores (TC) for accelerated matrix multiplications (WMMA). There are presently three main types of Mixed-precision support depending on the GPU architecture generation:
Detailed CIANNA data type support per generation:
CIANNA syntax detail in function init_network for argument mixed_precision: "StoreComputeTypeC_AccumulatorTypeA". For the sake of simplicity, two modes are overloaded with "off" and "on" corresponding to "FP32C_FP32A" and "FP16C_FP32A" respectively.
Example: "FP16C_FP32A" to store data in FP16 format and perform FP16 computation with FP32 accumulator. Classical FP32 inference can be performed on a network trained with Mixed-Precision since master weights are stored in FP32. (For best training results, it is recommended to use a low-bit type that has the same numerical range as FP32, namely TF32 and BF16).