Inference time benchmark comparison between CIANNA, TensorFlow and PyTorch for two architectures (LeNet5, darknet19).
[Up to date with V-1.0.0.0 release]
Main performance-related observations regarding CIANNA:
On par with other widely adopted frameworks for raw compute performance.
Takes a significant lead in latency-dominated workload (small networks)
Scale well with the batch size on small systems or when using the CPU compute method.
Perfectly usable for real-time applications on light systems (e.g., YOLO detection connected to a camera feed at a high framerate)
For a performance comparison between available compute methods with CIANNA over a given light system, refer to the step-by-step ubuntu installation guide.
Inference performance benchmark (06/2025)
Configuration: Free Google Colab environment from June 2025, using an Nvidia T4 GPU (Turing, 2560 Cuda cores, 16 GB RAM, peak FP32: 8.141 TFLOPS, and FP16: 65.13 TFLOPS). The used versions of each framework are: CIANNA 1.0 (Cuda 12.5), TensorFlow 2.71 (Cuda 12.5), PyTorch 2.6.0 (Cuda 12.4). All networks are evaluated for inference only, but most observations remain valid for training. Network architectures and notebooks are provided at the end of the page.
Observations
CIANNA strongly outperforms both TF and Torch for small networks and small batch sizes (latency-dominated regime).
For large batch sizes (2048 and above), CIANNA with FP32 compute gets slower than TF and Torch, but it remains the fastest with FP16 compute up to a batch size of 2048, where it matches TF's FP16 performance.
All frameworks scale similarly with the input size.
Remarks about the benchmark
We acknowledge that the Google Colab environment is far from optimal for benchmarking, as performance can vary significantly with reruns or depending on the time of day. It still offers a relatively reproducible configuration that is easily accessible for anyone to reproduce the benchmark. Each measure corresponds to the maximum performance achieved over multiple runs, accounting for warm-up and variability. All tests were conducted within a relatively short time window to minimize the variability of global load on Colab servers.
Comparing frameworks is a challenging task, as they operate differently at the lower level. We designed our benchmark based on what can be measured during a typical CIANNA inference and tried to reproduce the same set of operations with TF and Torch. In practice, performance measurement covers the following steps: data transformation into a framework-specific format, data transfer from the CPU to the GPU, and computation time over a fixed-size test dataset. When computing in FP16 mixed precision, we exclude the conversion of the original CPU-hosted dataset to FP16. However, converting to FP16 tensor and moving it from the CPU to the GPU is accounted for in the time measurement. These choices could be discussed, but the current setup should be quite close to real-world inference performance on a large dataset for all frameworks.
This benchmark was conducted without any form of JIT/XLA compilation or CUDA graph construction. Since it is difficult to decouple compilation and graph construction, and since CIANNA does not yet support CUDA graphs, it would lead to an unfair comparison. Preliminary tests using compilation and graphs with Torch showed a speedup of around 5 to 10% at all batch sizes on the LeNet5 test. With TF, the improvement is much more striking, achieving a 100% speedup across all batch sizes and a maximum inference speed of 250,000 IPS in mixed-FP16 using batches of 2048. Still, CIANNA remains significantly faster for small batches (8 and 16). For the deeper darknet19-like architecture, we observe a 50% speedup for Torch at 224x224 resolution, but almost no improvement at higher resolutions. In contrast, for TF, the mixed-FP16 performance increases by 20% at both resolutions, but there is no improvement when using FP32. This illustrates again that not all frameworks face the same limitations and bottlenecks.