Energetics of the brain and AI

Lawrence Krauss is not worried about AI risk (ht to Luke Muelhauser); while much of his complacency is based on a particular view of the trustworthiness and level of common sense exhibited by possible future AI that is pretty impossible to criticise, he makes a particular claim:

First, let’s make one thing clear. Even with the exponential growth in computer storage and processing power over the past 40 years, thinking computers will require a digital architecture that bears little resemblance to current computers, nor are they likely to become competitive with consciousness in the near term. A simple physics thought experiment supports this claim:

Given current power consumption by electronic computers, a computer with the storage and processing capability of the human mind would require in excess of 10 Terawatts of power, within a factor of two of the current power consumption of all of humanity. However, the human brain uses about 10 watts of power. This means a mismatch of a factor of 1012, or a million million. Over the past decade the doubling time for Megaflops/watt has been about 3 years. Even assuming Moore’s Law continues unabated, this means it will take about 40 doubling times, or about 120 years, to reach a comparable power dissipation. Moreover, each doubling in efficiency requires a relatively radical change in technology, and it is extremely unlikely that 40 such doublings could be achieved without essentially changing the way computers compute.

This claim has several problems. First, there are few, if any, AI developers who think that we must stay with current architectures. Second, more importantly, the community concerned with superintelligence risk is generally agnostic about how soon smart AI could be developed: it doesn’t have to happen soon for us to have a tough problem in need of a solution, given how hard AI value alignment seems to be. And third, consciousness is likely irrelevant for instrumental intelligence; maybe the word is just used as a stand-in for some equally messy term like “mind”, “common sense” or “human intelligence”.

The interesting issue is however what energy requirements and computational power tells us about human and machine intelligence, and vice versa.

Computer and brain emulation energy use

PowergridI have earlier on this blog looked at the energy requirements of the Singularity. To sum up, current computers are energy hogs requiring 2.5 TW of power globally, with an average cost around 25 nJ per operation. More efficient processors are certainly possible (a lot of the current ones are old and suboptimal). For example, current GPUs consume about a hundred Watts and have 10^{10} transistors, reaching performance in the 100 Gflops range, one nJ per flop. Koomey’s law states that the energy cost per operation halves every 1.57 years (not 3 years as Krauss says). So far the growth of computing capacity has grown at about the same pace as energy efficiency, making the two trends cancel each other. In the end, Landauer’s principle gives a lower bound of kT\ln(2) J per irreversible operation; one can circumvent this by using reversible or quantum computation, but there are costs to error correction – unless we use extremely slow and cold systems in the current era computation will be energy-intensive.

I am not sure what brain model Krauss bases his estimate on, but 10 TW/25 nJ = 4\cdot 10^{20} operations per second (using slightly more efficient GPUs ups it to 10^{22} flops). Looking at the estimates of brain computational capacity in appendix A of my old roadmap, this is higher than most. The only estimate that seem to be in the same ballpark is (Thagard 2002), which argues that the number of computational elements in the brain are far greater than the number of neurons (possibly even individual protein molecules). This is a fairly strong claim, to say the least. Especially since current GPUs can do a somewhat credible job of end-to-end speech recognition and transcription: while that corresponds to a small part of a brain, it is hardly 10^{-11} of a brain.

Generally, assuming a certain number of operations per second in a brain and then calculating an energy cost will give you any answer you want. There are people who argue that what really matters is the tiny conscious bandwidth (maybe 40 bits/s or less) and that over a lifetime we may only learn a gigabit. I used 10^{22} to 10^{25} flops just to be on the safe side in one post. AIimpacts.org has collected several estimates, getting the median estimate 10^{18}. They have also argued in favor of using TEPS (traversed edges per second) rather than flops, suggesting around 10^{14} TEPS for a human brain – a level that is soon within reach of some systems.

(Lots of apples-to-oranges comparisions here, of course. A single processor operation may or may not correspond to a floating point operation, let alone to what a GPU does or a TEPS. But we are in the land of order-of-magnitude estimates.)

Brain energy use

Poke-a-brainWe can turn things around: what does the energy use of human brains tell us about their computational capacity?

Ralph Merkle calculated back in 1989 that given 10 Watts of usable energy per human brain, and that the cost of each jump past a node of Ranvier cost 5\cdot 10^{-15} J, producing 2\cdot 10^{15} such operations. He estimated this was about equal to the number of synaptic operations, ending up with 10^{13}10^{16} operations per second.

A calculation I overheard at a seminar by Karlheinz Meier argued the brain uses 20 W power, has 100 billion neurons firing per second, uses 10^{-10} J per action potential, plus it has 10^{15} synapses receiving signals at about 1 Hz, and uses 10^{-14} J per synaptic transmission. One can also do it from the bottom to the top: there are 10^9 ATP molecules per action potential, 10^5 are needed for synaptic transmission. 10^{-19} J per ATP gives 10^{-10} J per action potential and 10^{-14} J per synaptic transmission. Both these converge on the same rough numbers, used to argue that we need much better hardware scaling if we ever want to get to this level of detail.

Digging deeper into neural energetics, maintaining resting potentials in neurons and glia account for 28% and 10% of the total brain metabolic cost, respectively, while the actual spiking activity is about 13% and transmitter release/recycling plus calcium movement is about 1%. Note how this is not too far from the equipartition in Meier’s estimate. Looking at total brain metabolism this constrains the neural firing rate: more than 3.1 spikes per second per neuron would consume more energy than the brain normally consumes (and this is likely an optimistic estimate). The brain simply cannot afford firing more than 1% of neurons at the same time, so it likely relies on rather sparse representations.

Unmyelinated axons require about 5 nJ/cm to transmit action potentials. In general, the brain gets around it through some current optimization, myelinisation (which also speeds up transmission at the price of increased error rate), and likely many clever coding strategies. Biology is clearly strongly energy constrained. In addition, cooling 20 W through a bloodflow of 750-1000 ml/min is relatively tight given that the arterial blood is already at body temperature.

20 W divided by 1.3\cdot 10^{-21} J (the Landauer limit at body temperature) suggests a limit of no more than 1.6\cdot 10^{22} irreversible operations per second. While a huge number, it is just a few orders higher than many of the estimates we have been juggling so far. If we say these operations are distributed across 100 billion neurons (which is at least within an order of magnitude of the real number) we get 160 billion operations per second per neuron; if we instead treat synapses (about 8000 per neuron) as the loci we get 20 million operations per second per synapse.

Running the full Hodgkin-Huxley neural model at 1 ms resolution requires about 1200 flops, or 1.2 million flops per second of simulation. If we treat a synapse as a compartment (very reasonable IMHO) that is just 16.6 times the Landauer limit: if the neural simulation had multiple digit precision and erased a few of them per operation we would bump into the Landauer limit straight away. Synapses are actually fairly computationally efficient! At least at body temperature: cryogenically cooled computers could of course do way better. And as Izikievich, the originator of the 1200 flops estimate, loves to point out, his model requires just 13 flops: maybe we do not need to model the ion currents like HH to get the right behavior, and can suddenly shave off two orders of magnitude.

Information dissipation in neural networks

Just how much information is lost in neural processing?

A brain is a dynamical system changing internal state in a complicated way (let us ignore sensory inputs for the time being). If we start in a state somewhere within some predefined volume of state-space, over time the state will move to other states – and the initial uncertainty will grow. Eventually the possible volume we can find the state in will have doubled, and we will have lost one bit of information.

Intermittent Lorenz AttractorThings are a bit more complicated, since the dynamics can contract along some dimensions and diverge along others: this is described by the Lyapunov exponents. If the trajectory has exponent \lambda in some direction nearby trajectories diverge like |x_a(t)-x_b(t)| \propto |x_a(0)-x_b(0)| e^{\lambda t} in that direction. In a dissipative dynamical system the sum of the exponents is negative: in total, trajectories move towards some attractor set. However, if at least one of the exponents is positive, then this can be a strange attractor that the trajectories endlessly approach, yet they locally diverge from each other and gradually mix. So if you can only measure with a fixed precision at some point in time, you can not certainly tell where the trajectory was before (because of the contraction due to negative exponents has thrown away starting location information), nor exactly where it will be on the attractor in the future (because the positive exponents are amplifying your current uncertainty).

A measure of the information loss is the Kolmogorov-Sinai entropy, which is bounded by K \leq \sum_{\lambda_i>0} \lambda_i, the positive Lyapunov exponents (equality holds for Axiom A attractors). So if we calculate the KS-entropy of a neural system, we can estimate how much information is being thrown away per unit of time.

Monteforte and Wolf looked at one simple neural model, the theta-neuron (presentation). They found a KS-entropy of roughly 1 bit per neuron and spike over a fairly large range of parameters. Given the above estimates of about one spike per second per neuron, this gives us an overall information loss of 10^{11} bits/s in the brain, which is 1.3\cdot 10^{-10} W at the Landauer limit – by this account, we are some 11 orders of magnitude away from thermodynamic perfection. In this picture we should regard each action potential corresponding to roughly one irreversible yes/no decision: a not too unreasonable claim.

I begun to try to estimate the entropy and Lyapunov exponents of the Izikievich network to check for myself, but decided to leave this for another post. The reason is that calculating the Lyapunov exponents from time series is a pretty delicate thing, especially when there is noise. And the KS-dimension is even more noise-sensitive. In research on EEG data (where people have looked at the dimension of chaotic attractors and their entropies to distinguish different mental states and epilepsy) an approximate entropy measure is used instead.

It is worth noticing that one can look at cognition as a system with a large-scale dynamics that has one entropy (corresponding to shifting between different high-level mental states) and microscale dynamics with different entropy (corresponding to the neural information processing). It is a safe bet that the biggest entropy costs are on the microscale (fast, numerous simple states) than the macroscale (slow, few but complex states).

Energy of AI

Mark IWhere does this leave us in regards to the energy requirements of artificial intelligence?

Assuming the same amount of energy is needed for a human and machine to do a cognitive task is a mistake.

First, as the Izikievich neuron demonstrates, it might be that judicious abstraction easily saves two orders of magnitude of computation/energy.

Special purpose hardware can also save one or two orders of magnitude; using general purpose processors for fixed computations is very inefficient. This is of course why GPUs are so useful for many things: in many cases you just want to perform the same action on many pieces of data rather than different actions on the same piece.

But more importantly, on what level the task is implemented matters. Sorting or summing a list of a thousand elements is a fast computer operation that can be done in memory, but a hour-long task for a human: because of our mental architecture we need to represent the information in a far more redundant and slow way, not to mention perform individual actions on the seconds time-scale. A computer sort uses a tight representation more like our low-level neural circuitry. I have no doubt one could string together biological neurons to perform a sort or sum operation quickly, but cognition happens on a higher, more general level of the system (intriguing speculations about idiot savants aside).

While we have reason to admire brains, they are also unable to perform certain very useful computations. In artificial neural networks we often employ non-local matrix operations like inversion to calculate optimal weights: these computations are not possible to perform locally in a distributed manner. Gradient descent algorithms such as backpropagation are unrealistic in a biological sense, but clearly very successful in deep learning. There is no shortage of papers describing various clever approximations that would allow a more biologically realistic system to perform similar operations – in fact, the brains may well be doing it – but artificial systems can perform them directly, and by using low-level hardware intended for it, very efficiently.

When a deep learning system learns object recognition in an afternoon it beats a human baby by many months. When it learns to do analogies from 1.6 billion text snippets it beats human children by years. Yes, these are small domains, yet they are domains that are very important for humans and would presumably develop as quickly as possible in us.

Biology has many advantages in robustness and versatility, not to mention energy efficiency. But it is also fundamentally limited by what can be built out of cells with a particular kind of metabolism, that organisms need to build themselves from the inside, and the need of solving problems that exist in a particular biospheric environment.

Conclusion

Unless one thinks the human way of thinking is the most optimal or most easily implementable way, we should expect de novo AI to make use of different, potentially very compressed and fast, processes. (Brain emulation makes sense if one either cannot figure out how else to do AI, or one wants to copy extant brains for their properties.) Hence, the costs of brain computation is merely a proof of existence that there are systems that effective – the same mental tasks could well be done by far less or far more efficient systems.

In the end, we may try to estimate fundamental energy costs of cognition to bound AI energy use. If human-like cognition takes a certain number of bit erasures per second, we would get some bound using Landauer (ignoring reversible computing, of course). But as the above discussion has showed, it may be that the actual computational cost needed is just some of the higher level representations rather than billions of neural firings: until we actually understand intelligence we cannot say. And by that point the question is moot anyway.

Many people have the intuition that the cautious approach is always to state “thing’s won’t work”. But this mixes up cautious with conservative (or even reactionary). A better cautious approach is to recognize that “things may work”, and then start checking the possible consequences. If we want a reassuring constraint on why certain things cannot happen it need to be tighter than energy estimates.

Energy requirements of the singularity

Infinity of Forces: The BeanstalkAfter a recent lecture about the singularity I got asked about its energy requirements. It is a good question. As my inquirer pointed out, humanity uses more and more energy and it generally has an environmental cost. If it keeps on growing exponentially, something has to give. And if there is a real singularity, how do you handle infinite energy demands?

First I will look at current trends, then different models of the singularity.

I will not deal directly with environmental costs here. They are relative to some idea of a value of an environment, and there are many ways to approach that question.

Current trends

Current computers are energy hogs. Currently general purpose computing consumes about one Petawatt-hour per year, with the entire world production somewhere above 22 Pwh.  While large data centres may be obvious, the vast number of low-power devices may be an even more significant factor; up to 10% of our electricity use may be due to ICT.

Together they perform on the order of 10^{20} operations per second, or somewhere in the zettaFLOPS range.

Koomey’s law states that the number of computations per joule of energy dissipated has been doubling approximately every 1.57 years. This might speed up as the pressure to make efficient computing for wearable devices and large data centres makes itself felt. Indeed, these days performance per watt is often more important than performance per dollar.

Meanwhile, general-purpose computing capacity has a growth rate of 58% per annum, doubling every 18 months. Since these trends cancel rather neatly, the overall energy need is not changing significantly.

The push for low-power computing may make computing greener, and it might also make other domains more efficient by moving tasks to the virtual world, making them efficient and allowing better resource allocation. On the other hand, as things become cheaper and more efficient usage tends to go up, sometimes outweighing the gain. Which trend wins out in the long run is hard to predict.

Semilog plot of global energy consumption over time.
Semilog plot of global energy (all types) consumption over time.

Looking at overall energy use trends it looks like overall energy use increases exponentially (but has stayed at roughly the same per capita level since the 1970s). In fact, plotting it on a semilog graph suggests that it is increasing faster than exponential (otherwise it would be a straight line). This is presumably due to a combination of population increase and increased energy use. The best fit exponential has a doubling time of 44.8 years.

Electricity use is also roughly exponential, with a doubling time of 19.3 years. So we might be shifting more and more to electricity, and computing might be taking over more and more of that.

Extrapolating wildly, we would need the total solar input on Earth in about 300 years and the total solar luminosity in 911 years. In about 1,613 years we would have used up the solar system’s mass energy. So, clearly, long before then these trends will break one way or another.

Physics places a firm boundary due to the Landauer principle: in order to erase on bit of information k T \ln(2) joules of energy have to be dissipated. Given current efficiency trends we will reach this limit around 2048.

The principle can be circumvented using reversible computation, either classical or quantum. But as I often like to point out, it still bites in the form of the need for error correction (erasing accidentally flipped bits) and formatting new computational resources (besides the work in turning raw materials into bits). We should hence expect a radical change in computation within a few decades, even if the cost per computation and second continues to fall exponentially.

What kind of singularity?

But how many joules of energy does a technological singularity actually need? It depends on what kind of singularity. In my own list of singularity meanings we have the following kinds:

A. Accelerating change
B. Self improving technology
C. Intelligence explosion
D. Emergence of superintelligence
E. Prediction horizon
F. Phase transition
G. Complexity disaster
H. Inflexion point
I. Infinite progress

Case A, acceleration, at first seems to imply increasing energy demands, but if efficiency grows faster they could of course go down.

Eric Chaisson has argued that energy rate density, how fast and densely energy get used (watts per kilogram), might be an indicator of complexity and growing according to a universal tendency. By this account, we should expect the singularity to have an extreme energy rate density – but it does not have to be using enormous amounts of energy if it is very small and light.

He suggests energy rate density may increase as Moore’s law, at least in our current technological setting. If we assume this to be true, then we would have \Phi(t) = \exp(kt) = P(t)/M(t), where P(t) is the power of the system and M(t) is the mass of the system at time t. One can maintain exponential growth by reducing the mass as well as increasing the power.

However, waste heat will need to be dissipated. If we use the simplest model where a radius R system with density \rho radiates it away into space, then the temperature will be T=[\rho \Phi R/3 \sigma]^{1/4}, or, if we have a maximal acceptable temperature, R < 3\sigma T^4 / \rho \Phi. So the system needs to become smaller as \Phi increases. If we use active heat transport instead (as outlined in my previous post), covering the surface with heat pipes that can remove X watts/square meter, then R < 3 X / \Phi \rho. Again, the radius will be inversely proportional to \Phi. This is similar to our current computers, where the CPU is a tiny part surrounded by cooling and energy supply.

If we assume the waste heat is just due to erasing bits, the rate of computation will be I = P/kT \ln(2) = \Phi M / kT\ln(2) = [4 \pi \rho /3 k \ln(2)] \Phi R^3 / T bits per second. Using the first cooling model gives us I \propto T^{11}/ \Phi^2 – a massive advantage for running extremely hot and dense computation. In the second cooling model I \propto \Phi^{-2}: in both cases higher energy rate densities make it harder to compute when close to the thermodynamic limit. Hence there might be an upper limit to how much we may want to push \Phi.

Also, a system with mass M will use up its own mass-energy in time Mc^2/P = c^2/\Phi: the higher the rate, the faster it will run out (and it is independent of size!). If the system is expanding at speed v it will gain and use up mass at a rate M'= 4\pi\rho v t^2 - M\Phi(t)/c^2; if \Phi grows faster than quadratic with time it will eventually run out of mass to use. Hence the exponential growth must eventually reduce simply because of the finite lightspeed.

The Chaisson scenario does not suggest a “sustainable” singularity. Rather, it suggests a local intense transformation involving small, dense nuclei using up local resources. However, such local “detonations” may then spread, depending on the long-term goals of involved entities.

Cases B, C, D (intelligence explosions, superintelligence) have an unclear energy profile. We do not know how complex code would become or what kind of computational search is needed to get to superintelligence. It could be that it is more a matter of smart insights, in which case the needs are modest, or a huge deep learning-like project involving massive amounts of data sloshing around, requiring a lot of energy.

Case E, a prediction horizon, is separate from energy use. As this essay shows, there are some things we can say about superintelligent computational systems based on known physics that likely remains valid no matter what.

Case F, phase transition, involves a change in organisation rather than computation, for example the formation of a global brain out of previously uncoordinated people. However, this might very well have energy implications. Physical phase transitions involve discontinuities of the derivatives of the free energy. If the phases have different entropies (first order transitions) there has to be some addition or release of energy. So it might actually be possible that a societal phase transition requires a fixed (and possibly large) amount of energy to reorganize everything into the new order.

There are also second order transitions. These are continuous do not have a latent heat, but show divergent susceptibilities (how much the system responds to an external forcing). These might be more like how we normally imagine an ordering process, with local fluctuations near the critical point leading to large and eventually dominant changes in how things are ordered. It is not clear to me that this kind of singularity would have any particular energy requirement.

Case G, complexity disaster, is related to superexponential growth, such as the city growth model of Bettancourt, West et al. or the work on bubbles and finite time singularities by Didier Sornette. Here the rapid growth rate leads to a crisis, or more accurately a series of crises increasingly rapidly succeeding each other until a final singularity. Beyond that the system must behave in some different manner. These models typically predict rapidly increasing resource use (indeed, this is the cause of the crisis sequence as one kind of growth runs into resource scaling problems and is replaced with another one), although as Sornette points out the post-singularity state might well be a stable non-rivalrous knowledge economy.

Case H, an inflexion point, is very vanilla. It would represent the point where our civilization is halfway from where we started to where we are going. It might correspond to “peak energy” where we shift from increasing usage to decreasing usage (for whatever reason), but it does not have to. It could just be that we figure out most physics and AI in the next decades, become a spacefaring posthuman civilization, and expand for the next few billion years, using ever more energy but not having the same intense rate of knowledge growth as during the brief early era when we went from hunter gatherers to posthumans.

Case I, infinite growth, is not normally possible in the physical universe. Information can as far as we know not be stored beyond densities set by the Bekenstein bound (I \leq k_I MR where k_I\approx 2.577\cdot 10^{43} bits per kg per meter), and we only have access to a volume 4 \pi c^3 t^3/3 with mass density \rho, so the total information growth must be bounded by I \leq 4 \pi k_I c^4 \rho t^4/3. It grows quickly, but still just polynomially.

The exception to the finitude of growth is if we approach the boundaries of spacetime. Frank J. Tipler’s omega point theory shows how information processing could go infinite in a finite (proper) time in the right kind of collapsing universe with the right kind of physics. It doesn’t look like we live in one, but the possibility is tantalizing: could we arrange the right kind of extreme spacetime collapse to get the right kind of boundary for a mini-omega? It would be way beyond black hole computing and never be able to send back information, but still allow infinite experience. Most likely we are stuck in finitude, but it won’t hurt poking at the limits.

Conclusions

Indefinite exponential growth is never possible for physical properties that have some resource limitation, whether energy, space or heat dissipation. Sooner or later they will have to shift to a slower rate of growth – polynomial for expanding organisational processes (forced to this by the dimensionality of space, finite lightspeed and heat dissipation), and declining growth rate for processes dependent on a non-renewable resource.

That does not tell us much about the energy demands of a technological singularity. We can conclude that it cannot be infinite. It might be high enough that we bump into the resource, thermal and computational limits, which may be what actually defines the singularity energy and time scale. Technological singularities may also be small, intense and localized detonations that merely use up local resources, possibly spreading and repeating. But it could also turn out that advanced thinking is very low-energy (reversible or quantum) or requires merely manipulation of high level symbols, leading to a quiet singularity.

My own guess is that life and intelligence will always expand to fill whatever niche is available, and use the available resources as intensively as possible. That leads to instabilities and depletion, but also expansion. I think we are – if we are lucky and wise – set for a global conversion of the non-living universe into life, intelligence and complexity, a vast phase transition of matter and energy where we are part of the nucleating agent. It might not be sustainable over cosmological timescales, but neither is our universe itself. I’d rather see the stars and planets filled with new and experiencing things than continue a slow dance into the twilight of entropy.

…contemplate the marvel that is existence and rejoice that you are able to do so. I feel I have the right to tell you this because, as I am inscribing these words, I am doing the same.
– Ted Chiang, Exhalation