There's only one way to save energy in edge computation in the real world: perform just the essential computations efficiently, and nothing else. This requires a different way of thinking from the standard approach. The key is sparsity.
Sparsity is the idea that changes in the real world don’t happen everywhere, or all at once. By identifying where the changes happen and computing only the effects and consequences of those changes, we can save up to 95% of the power normally used in processing.
With the right architecture, we can additively exploit four kinds of sparsity to save time and energy in computation.
Deep neural networks have lots of connections, but they are not all equal; usually a small number of links is responsible for the core of the computation. By identifying the key connections and processing only those parts of the network, we avoid large numbers of unnecessary computations.
A smart doorbell, even upon waking up by a smart trigger, spends more than 95% of its time with nothing to do and no one new to look at. By recognizing this, and not computing when nothing new is happening, we save an enormous amount of power or infer blindingly fast.
GML hardware combines two of the most exciting recent developments in computer architectures – neuromorphic engineering, and Dataflow computation – to implement some of the most efficient digital hardware ever developed.
A key feature of GML chips is that computation happens in the same block of silicon where weights and data are stored – computation near memory with 16-bit FP precision. This means that very little power and time are wasted in bringing together the data and the computation resulting in highly accurate and low power inferences for your endpoint devices.