System Requirements

[Up to date with V-1.0.0.0 release]

This page outlines both software and hardware requirements as they relate to each other.

Computing Methods

System Requirements depend on the computing method to be used.

C_NAIV

This computing method should work on any system compatible with a C compiler. Worked on every system we tried. Despite its high portability, this method is very slow and should only be used for very light network architectures or pedagogical purposes. Most of the NAIV functions support multi-threading using OpenMP.

C_BLAS

Rely on the OpenBLAS library, hence sharing the same requirements. Most of the non-BLAS functions support multi-threading using OpenMP. We recommend compiling OpenBLAS with the USE_OPENMP=1 flag for the best performance in CIANNA.

Recent OpenBLAS versions should support the most recent Intel, AMD (including Zen), and ARM-based CPUs. It worked flawlessly on every system we tried (non-exhaustive list):

Consumer-grade: i7-3770k, i5-8500, i7-12700, Ryzen9 5900x (Zen 3), ... several dual and quad-core light Intel desktop CPUs
Laptop consumer and pro-grade: i9-12900H, i7-4700HQ, Xeon E-2176m, ...
Workstation and server-grade (Mono and Bi-socket): Xeon (E-2620V2, 4110, 4116, 4214, 6130, 6444Y, ...)
Portable device: ARM Snapdragon 888, ...

C_CUDA

Currently, CIANNA only supports Nvidia GPUs exclusively through CUDA (we plan to add a HIP version, with no ETA). It also only uses one GPU at a time; multi-GPU support may be added at some point (again, no ETA). Multiple GPUs can still be used through independent instances of CIANNA to explore the hyperparameter space. For inference, since predictions are independent, items can be split across any number of CIANNA instances with GPUs in parallel.

The minimum CUDA version is set to 9.0 since this is the first version to support the cuBLAS GemmEX operations (which also explains why Kepler GPU (sm_30 and sm_35) are not supported). We strongly recommend 10+, and the latest is usually the best (currently 13.1). The two main reasons for the absence of support for older versions are:

Supporting both Tensor Cores capabilities and older CUDA versions would require subsequent version-aware programming (already a substantial amount in the CUDA part of CIANNA), due to the lack of guaranteed backward and forward compatibility in CUDA itself and between GPU generations.
Versions >= 10.0 have been strongly optimized for Deep Learning (CIANNA compiled with CUDA 9.2 is half the speed of 10.1 on a V100 GPU), and performances drop even further with older versions.

We emphasize that this is solely a software version requirement. In practice, CIANNA can run on a broad range of Nvidia GPU generations: Maxwell, Volta, Pascal, Turing, Ampere, Ada Lovelace, Hopper, and Blackwell. We successfully used CIANNA on a wide variety of GPUs (almost-exhaustive list): Quadro M1200, Quadro P2000m, Quadro P40, Quadro RTX 5000, Tesla T4, Tesla P100, Tesla V100, RTX A6000, RTX A5000, RTX 6000 Ada, Titan X, Titan Xp, Titan RTX, RTX 2080, RTX 3090, RTX 4090, H100, RTX 5090, ... (If you tested CIANNA on other GPUs, please send us some feedback.)

Note that several versions of CUDA can be installed on the same system as long as the most recent Nvidia driver is installed.

Mixed Precision / Tensor core support

Mixed precision is an approach where all computations are done using a low-bit-count data format, but with a higher bit-count accumulator when necessary [usually FP32 (32-bit)]. Multiple low-resolution formats are available depending on the total bit count and their distribution over mantissa and exponent [TF32 (19-bit), FP16, BF16, FP8, FP4, INT8, INT4, ...]. Master weights are generally kept with high precision and a high bit count to remain sensitive to fine-grained updates, but weights with reduced precision are used for computation in the forward path (detailed information on Nvidia website).

Mixed precision improves the memory bandwidth, reduces/halves VRAM usage, speeds up computations, and enables usage of dedicated hardware Tensor-cores (TC) for accelerated matrix multiplications (WMMA). There are presently three main types of Mixed-precision support depending on the GPU architecture generation:

Old GPUs: (Maxwell and older) do not support anything besides FP32 computing. Maxwell is still compatible with CIANNA, but without mixed precision.
Middle-aged GPUs: (Pascal and some Maxwell Pro-grade): do support FP16 computing, but with very low performance (usually 1/64 of FP32 TFLOPS).¹
GPUs with AI accelerators: (Volta, Turing, Ampere, Ada Lovelace, Hopper, Blackwell) Those have been specialized for Deep Learning and Mixed precision with the addition of dedicated Tensor Cores. The gain of using CIANNA Mixed Precision with them will often be significant, but it will depend on the GPU Generation (also very sensitive to the CUDA version).²

Detailed CIANNA data type support per generation:

Maxwell (no TC): FP32
Pascal (no TC): FP32, FP16.
Volta (TC-v1): FP32, FP16 with FP16 or FP32 acc.
Turing (TC-v2): FP32, FP16 with FP16 or FP32 acc.³
Ampere (TC-v3): FP32, FP16 with FP16 or FP32 acc, BF16 with FP32 acc, TF32 with FP32 acc.
Ada/Hopper (TC-v4): FP32, FP16 with FP16 or FP32 acc, BF16 with FP32 acc, TF32 with FP32 acc (other types possible but unsupported in CIANNA).⁴
Blackwell (TC-v5): FP32, FP16 with FP16 or FP32 acc, BF16 with FP32 acc, TF32 with FP32 acc (other types possible but unsupported in CIANNA).

CIANNA syntax detail in function init_network for argument mixed_precision: "StoreComputeTypeC_AccumulatorTypeA". For the sake of simplicity, two modes are overloaded with "off" and "on" corresponding to "FP32C_FP32A" and "FP16C_FP32A" respectively.

Example: "FP16C_FP32A" to store data in FP16 format and perform FP16 computation with FP32 accumulator. Classical FP32 inference can be performed on a network trained with Mixed-Precision since master weights are stored in FP32. (For best training results, it is recommended to use a low-bit type that has the same numerical range as FP32, namely TF32 and BF16).

For GPUs that have lower compute performances using mixed precision types, depending on what is limiting your neural network throughput, it might be possible that the gain in memory bandwidth is stronger and that CIANNA FP16 Mixed Precision is faster (for example, it is the case on a P2000 Mobile). Additionally, since it uses roughly half the GPU memory, some problems become easier to solve on lighter systems. ↩
When using a recent version of CUDA (>=11.2) and a modern GPU ( >= Ampere), FP32 precision matrix operations in CuBLAS are automatically converted to TF32 with TC acceleration by default, unless real pedantic FP32 is manually specified. Using CIANNA, the default CUDA compute is an explicit pedantic FP32, while auto-cast to TF32 with TC acceleration is an opt-in feature. ↩
Turing generation introduces integer quantized data types (INT8, INT4, ...) for TC acceleration, but none of them is available in CIANNA at the moment. ↩
Hopper generations introduced the FP8 data type with TC acceleration, which can be used with FP16 or FP32 accumulators. This data type is also available for Ada Lovelace cards using a CUDA version 12.1 Update 1 or later. Unfortunately, they require the use of the cuBLASLt API (CUDA >= 10.1, but a critical update was introduced in CUDA 12.2 Update 2) and are not currently available with the regular cuBLAS API used in CIANNA, which offers better backward compatibility. More recently, the Blackwell generation introduced FP4, which is not yet supported for the same reason. ↩

Installation

How To Use

Examples

API

Dev Blog

Publications