Nvidia's Tesla P4 And P40 GPUs Boost Deep Learning Inference Performance With INT8, TensorRT Support

Nvidia Tesl P40Nvidia continues to double down on deep learning GPUs with the release of two new “inference” GPUs, the Tesla P4 and the Tesla P40. The pair are the 16nm FinFET direct successors to Tesla M4 and M40, with much improved performance and support for 8-bit (INT8) operations.

Deep learning consists of two steps: training and inference. For training, it can take billions of TeraFLOPS to achieve an expected result over a matter of days (while using GPUs). For inference, which is the running of the trained models against new data, it can take billions of FLOPS, and it can be done in real-time.

The two steps in the deep learning process require different levels of performance, but also different features. This is why Nvidia is now releasing the Tesla P4 and P40, which are optimized specifically for running inference engines, such as Nvidia’s recently launched TensorRT inference engine.

Unlike the Pascal-based Tesla P100, which comes with support for the already quite low 16-bit (FP16) precision, the two new GPUs bring support for the even lower 8-bit INT8 precision. This is because the researchers have discovered that you don’t need especially high precision for deep learning training.

The expected results will appear significantly faster if you use twice as much data with half the precision. Because inference operates on already-trained data, even less precision is needed than for training, which is why Nvidia’s new cards now have support for INT8 operations.

Tesla P4

The Tesla P4 is the lower-end GPU from the two that were announced, and it’s targeted at scale-out servers that want highly-efficient GPUs. Each Tesla P4 GPU uses between 50W and 75W of power, for a peak performance of 5.5 (FP32) TeraFLOP/s and 21.8 INT8 TOP/s (Tera-Operations per second).

Nvidia compared its Tesla P4 GPU to an Intel Xeon E5 general purpose CPU and alleged that the P4 is up to 40x more efficient on the AlexNet image processing test. The company also claimed that the Tesla P4 is 8x more efficient than an Arria 10-115 FPGA (made by Altera, which Intel acquired).

Tesla P40

The Tesla P40 was designed for scale-up servers, where performance matters most. Thanks to improvements in the Pascal architecture as well as the jump from the 28nm planar process to a 16nm FinFET process, Nvidia claimed that the P40 is up to 4x faster than its predecessor, the Tesla M40.

The P40 GPU has a peak performance of 12 (FP32) TeraFLOP/s and 47 TOP/s, so it’s about twice as fast as its little brother, the Tesla P4. Tesla P40 has a maximum power consumption of 250W.

TensorRT

Nvidia also announced the TensorRT GPU inference engine that doubles the performance compared to previous cuDNN-based software tools for Nvidia GPUs. The new engine also has support for INT8 operations, so Nvidia’s new Tesla P4 and P40 will be able to work at maximum efficiency from day one.

In the graph below, Nvidia compared the performance of the Tesla P4 and P40 GPUs while using the TensorRT inference engine to a 14-core Intel E5-2690v4 running Intel’s optimized version of the Caffe neural networking framework. According to Nvidia’s results, the Tesla P40 seems to be up to 45x faster than Intel’s CPU here.

So far Nvidia has been comparing its GPUs to Intel’s general purpose CPUs alone, but Intel’s main product for deep learning is now the Xeon Phi line of chips with its “many-core” (Atom-based) accelerators.

Nvidia’s GPUs likely still beat those chips by a healthy margin due to the inherent advantage GPUs have even over many-core CPUs for such low-precision operations. However, at this point, comparing Xeon Phi with Nvidia’s GPUs would be a more realistic scenario in terms of what their customers are looking to buy for deep learning applications