There's only one way to save energy in edge computation in the real world: perform just the essential computations efficiently, and nothing else. This requires a different way of thinking from the standard approach. The key is sparsity.
Sparsity is the idea that changes in the real world don’t happen everywhere, or all at once. By identifying where the changes happen and computing only the effects and consequences of those changes, we can save up to 95% of the power normally used in processing.
With the right architecture, we can additively exploit four kinds of sparsity to save time and energy in computation.
Deep neural networks have lots of connections, but they are not all equal; usually a small number of links is responsible for the core of the computation. By identifying the key connections and processing only those parts of the network, we avoid large numbers of unnecessary computations.
We need lots of pixels to give us the high resolution required for real intelligent vision. However, we don’t need, or use, that high resolution everywhere in the image, but only in the small zones where fine detail is required. By ignoring the massive number of pixels which don’t tell us anything new, we reduce the amount of computation by a huge factor.
A smart doorbell, even upon waking up by a smart trigger, spends more than 95% of its time with nothing to do and no one new to look at. By recognizing this, and not computing when nothing new is happening, we save an enormous amount of power or infer blindingly fast.
For any single deep neural network decision, only about 40% of the neurons actually “fire” or have a non-zero output. If the output doesn’t count, we don’t compute its effect in the rest of the network, thereby saving another 60% of energy by avoiding irrelevant computation.
GML hardware combines two of the most exciting recent developments in computer architectures – neuromorphic engineering, and Dataflow computation – to implement some of the most efficient digital hardware ever developed.
A key feature of GML chips is that computation happens in the same block of silicon where weights and data are stored – computation near memory with 16-bit FP precision. This means that very little power and time are wasted in bringing together the data and the computation resulting in highly accurate and low power inferences for your endpoint devices.