Learn from HPC How to Model ML Performance Beyond FLOPS, Params, MACS, or TOPS
Learn from HPC How to Model ML Performance Beyond FLOPS, Params, MACS, or TOPS
You’ve invented a brand new model architecture and you want to show it’s the fastest architecture on the block.
Do you evaluate performance using FLOPS? MACS? Param counts? TOPS?
It turns out none of these by themselves sufficiently model performance, but fortunately the HPC community has solved this problem with a robust performance model for parallel processing!
This basic equation models total time as the sum of communication time (from network, disk, memory, cache, and register movement) and computation time (arithmetic operations):
\[total\_time = communication\_time + computation\_time\]In the case of a neural network we can model as:
\[communication\_overhead = n\_params * param\_size + data\_between\_layers\] \[communication\_time \approx \dfrac{communication\_overhead}{memory\_bandwidth}\] \[computation\_overhead = n\_ops\] \[computation\_time \approx \dfrac{computation\_overhead}{ops\_throughput}\]Data sent between layers is the sum of the tensor volume for each layer output in a network multiplied by the element size. In some cases the volume is the number of activations, but not all data transferred is due to an activation (e.g. a residual connection, concat operation, unfused bias add, projection operations, etc).
What this means is that the total time for a network is a function of memory bandwidth and arithmetic throughput. Estimating total time (inference latency) depends on a weighting between communication time and computation time. The exact weighting varies based on memory_bandwidth
and ops_throughput
of the accelerator architecture and can be derived either from first principles or estimated based on observations.
All major neural network accelerators have seen huge leaps in ops_throughput
in recent years, making computation time relatively small. Focusing on reducing communication time with networks is more important now than ever. In neural networks this means reducing network weights and the volume of tensor data sent between layers. This can occur with mechanisms like layer fusion and batch norm folding, but also applies to low arithmetic intensity layers in general.
For tiny microprocessors and CPUs compute time can still be a bottleneck. We were faced with compute bottlenecks with the type of small DSPs we were using at Whisper.ai. This HPC inspired performance model combining communication time and computation time is flexible enough to describe accelerator bottlenecks both for big GPUs and TPUs or tiny CPUs and DSPs.
Limitations
Like any model, this performance model can be wrong in some cases. For example in TensorFlow Lite if an op isn’t implemented by an accelerator, you could in some cases actually have that layer run on a completely different processor and incur poor ops_throughput
and even more communication time.
I started my career working in a computational neuroscience lab with spiking neural networks that effectively have 1-bit activations. 1-bit activations minimize compute and communication. Unfortunately the reality of communication within a chip means sending 1-bit of information is fairly inefficient. We may need a new performance model one day 🤓
Accelerator-Oblivious Network Architectures
Exploration of performance models like these could lead us to model architectures that are both thrifty with compute and communication time.
Today the trend seems to be kicking off a NAS for each accelerator architecture you want to use - how brutish!
Perhaps instead one day we will see network architectures with high arithmetic intensity good for many accelerators, in the limit becoming “accelerator-oblivious” (in reference to “Cache-oblivious” programming).
In the meantime, start reporting communication and computation overheads:
\[communication\_overhead = n\_params * param\_size + data\_between\_layers\] \[computation\_overhead = n\_ops\]From these numbers, anyone can plug in their own constants for ops_throughput
and memory_throughput
to estimate inference latency and pick what architecture makes sense for a given accelerator.