Bayes’ Broadsword

Yesterday I gave a talk at the joint Bloomberg-London Futurist meeting “The state of the future” about the future of decisionmaking. Parts were updates on my policymaking 2.0 talk (turned into this chapter), but I added a bit more about individual decisionmaking, rationality and forecasting.

The big idea of the talk: ensemble methods really work in a lot of cases. Not always, not perfectly, but they should be among the first tools to consider when trying to make a robust forecast or decision. They are Bayes’ broadsword:

Forecasting

One of my favourite experts on forecasting is J Scott Armstrong. He has stressed the importance of evidence based forecasting, including checking how well different methods work. The general answer is: not very well, yet people keep on using them. He has been pointing this out since the 70s. It also turns out that expertise only gets you so far: expert forecasts are not very reliable either, and the accuracy levels out quickly with increasing level of expertise. One implication is that one should at least get cheap experts since they are about as good as the pricey ones. It is also known that simple models for forecasting tends to be more accurate than complex ones, especially in complex and uncertain situations (see also Haldane’s “The Dog and the Frisbee”). Another important insight is that it is often better to combine different methods than try to select the one best method.

Another classic look at prediction accuracy is Philip Tetlock’s Expert Political Judgment (2005) where he looked at policy expert predictions. They were only slightly more accurate than chance, worse than basic extrapolation algorithms, and there was a negative link to fame: high profile experts have an incentive to be interesting and dramatic, but not right. However, he noticed some difference between “hedgehogs” (people with One Big Theory) and “foxes” (people using multiple theories), with the foxes outperforming hedgehogs.

OK, so in forecasting it looks like using multiple methods, theories and data sources (including experts) is a way to get better results.

Statistical machine learning

A standard problem in machine learning is to classify something into the right category from data, given a set of training examples. For example, given medical data such as age, sex, and blood test results, diagnose what a particular disease a patient might suffer from. The key problem is that it is non-trivial to construct a classifier that works well on data different from the training data. It can work badly on new data, even if it works perfectly on the training examples. Two classifiers that perform equally well during training may perform very differently in real life, or even for different data.

The obvious solution is to combine several classifiers and average (or vote about) their decisions: ensemble based systems. This reduces the risk of making a poor choice, and can in fact improve overall performance if they can specialize for different parts of the data. This also has other advantages: very large datasets can be split into manageable chunks that are used to train different components of the ensemble, tiny datasets can be “stretched” by random resampling to make an ensemble trained on subsets, outliers can be managed by “specialists”, in data fusion different types of data can be combined, and so on. Multiple weak classifiers can be combined into a strong classifier this way.

The method benefits from having diverse classifiers that are combined: if they are too similar in their judgements, there is no advantage. Estimating the right weights to give to them is also important, otherwise a truly bad classifier may influence the output.

Iris data classified using an ensemble of classification methods. — Iris data classified using an ensemble of classification methods (LDA, NBC, various kernels, decision tree). Note how the combination of classifiers also roughly indicates the overall reliability of classifications in a region.

The iconic demonstration of the power of this approach was the Netflix Prize, where different teams competed to make algorithms that predicted user ratings of films from previous ratings. As part of the rules the algorithms were made public, spurring innovation. When the competition concluded in 2009, the leading teams all consisted of ensemble methods where component algorithms were from past teams. The two big lessons were (1) that a combination of not just the best algorithms, but also less accurate algorithms, were the key to winning, and (2) that organic organization allows the emergence of far better performance than having strictly isolated teams.

Group cognition

Condorcet’s jury theorem is perhaps the classic result in group problem solving: if a group of people hold a majority vote, and each has a probability p>1/2 of voting for the correct choice, then the probability the group will vote correctly is higher than p and will tend to approach 1 as the size of the group increases. This presupposes that votes are independent, although stronger forms of the theorem have been proven. (In reality people may have different preferences so there is no clear “right answer”)

Probability that groups of different sizes will reach the correct decision as a function of the individual probability of voting right.

By now the pattern is likely pretty obvious. Weak decision-makers (the voters) are combined through a simple procedure (the vote) into better decision-makers.

Group problem solving is known to be pretty good at smoothing out individual biases and errors. In The Wisdom of Crowds Surowiecki suggests that the ideal crowd for answering a question in a distributed fashion has diversity of opinion, independence (each member has an opinion not determined by the other’s), decentralization (members can draw conclusions based on local knowledge), and the existence of a good aggregation process turning private judgements into a collective decision or answer.

Perhaps the grandest example of group problem solving is the scientific process, where peer review, replication, cumulative arguments, and other tools make error-prone and biased scientists produce a body of findings that over time robustly (if sometimes slowly) tends towards truth. This is anything but independent: sometimes a clever structure can improve performance. However, it can also induce all sorts of nontrivial pathologies – just consider the detrimental effects status games have on accuracy or focus on the important topics in science.

Small group problem solving on the other hand is known to be great for verifiable solutions (everybody can see that a proposal solves the problem), but unfortunately suffers when dealing with “wicked problems” lacking good problem or solution formulation. Groups also have scaling issues: a team of N people need to transmit information between all N(N-1)/2 pairs, which quickly becomes cumbersome.

One way of fixing these problems is using software and formal methods.

The Good Judgement Project (partially run by Tetlock and with Armstrong on the board of advisers) participated in the IARPA ACE program to try to improve intelligence forecasts. They used volunteers and checked their forecast accuracy (not just if they got things right, but if claims that something was 75% likely actually came true 75% of the time). This led to a plethora of fascinating results. First, accuracy scores based on the first 25 questions in the tournament predicted subsequent accuracy well: some people were consistently better than others, and it tended to remain constant. Training (such a debiasing techniques) and forming teams also improved performance. Most impressively, using the top 2% “superforecasters” in teams really outperformed the other variants. The superforecasters were a diverse group, smart but by no means geniuses, updating their beliefs frequently but in small steps.

The key to this success was that a computer- and statistics-aided process found the good forecasters and harnessed them properly (plus, the forecasts were on a shorter time horizon than the policy ones Tetlock analysed in his previous book: this both enables better forecasting, plus the all-important feedback on whether they worked).

Another good example is the Galaxy Zoo, an early crowd-sourcing project in galaxy classification (which in turn led to the Zooniverse citizen science project). It is not just that participants can act as weak classifiers and combined through a majority vote to become reliable classifiers of galaxy type. Since the type of some galaxies is agreed on by domain experts they can used to test the reliability of participants, producing better weightings. But it is possible to go further, and classify the biases of participants to create combinations that maximize the benefit, for example by using overly “trigger happy” participants to find possible rare things of interest, and then check them using both conservative and neutral participants to become certain. Even better, this can be done dynamically as people slowly gain skill or change preferences.

The right kind of software and on-line “institutions” can shape people’s behavior so that they form more effective joint cognition than they ever could individually.

Conclusions

The big idea here is that it does not matter that individual experts, forecasting methods, classifiers or team members are fallible or biased, if their contributions can be combined in such a way that the overall output is robust and less biased. Ensemble methods are examples of this.

While just voting or weighing everybody equally is a decent start, performance can be significantly improved by linking it to how well the participants perform. Humans can easily be motivated by scoring (but look out for disalignment of incentives: the score must accurately reflect real performance and must not be gameable).

In any case, actual performance must be measured. If we cannot tell if some method is more accurate than something else, then either accuracy does not matter (because it cannot be distinguished or we do not really care), or we will not get the necessary feedback to improve it. It is known from the expertise literature that one of the key factors for it to be possible to become an expert on a task is feedback.

Having a flexible structure that can change is a good approach to handling a changing world. If people have disincentives to change their mind or change teams, they will not update beliefs accurately.

I got a good question after the talk: if we are supposed to keep our models simple, how can we use these complicated ensembles? The answer is of course that there is a difference between using a complex and a complicated approach. The methods that tend to be fragile are the ones with too many free parameters, too much theoretical burden: they are the complex “hedgehogs”. But stringing together a lot of methods and weighting them appropriately merely produces a complicated model, a “fox”. Component hedgehogs are fine as long as they are weighed according to how well they actually perform.

(In fact, adding together many complex things can make the whole simpler. My favourite example is the fact that the Kolmogorov complexity of integers grows boundlessly on average, yet the complexity of the set of all integers is small – and actually smaller than some integers we can easily name. The whole can be simpler than its parts.)

In the end, we are trading Occam’s razor for a more robust tool: Bayes’ Broadsword. It might require far more strength (computing power/human interaction) to wield, but it has longer reach. And it hits hard.

Appendix: individual classifiers

I used Matlab to make the illustration of the ensemble classification. Here are some of the component classifiers. They are all based on the examples in the Matlab documentation. My ensemble classifier is merely a maximum vote between the component classifiers that assign a class to each point.

Iris data classified using a naive Bayesian classifier assuming Gaussian distributions.

Iris data classified using a decision tree.

Iris data classified using Gaussian kernels.

Iris data classified using linear discriminant analysis.

Energy requirements of the singularity

After a recent lecture about the singularity I got asked about its energy requirements. It is a good question. As my inquirer pointed out, humanity uses more and more energy and it generally has an environmental cost. If it keeps on growing exponentially, something has to give. And if there is a real singularity, how do you handle infinite energy demands?

First I will look at current trends, then different models of the singularity.

I will not deal directly with environmental costs here. They are relative to some idea of a value of an environment, and there are many ways to approach that question.

Current trends

Current computers are energy hogs. Currently general purpose computing consumes about one Petawatt-hour per year, with the entire world production somewhere above 22 Pwh. While large data centres may be obvious, the vast number of low-power devices may be an even more significant factor; up to 10% of our electricity use may be due to ICT.

Together they perform on the order of $10^{20}$ operations per second, or somewhere in the zettaFLOPS range.

Koomey’s law states that the number of computations per joule of energy dissipated has been doubling approximately every 1.57 years. This might speed up as the pressure to make efficient computing for wearable devices and large data centres makes itself felt. Indeed, these days performance per watt is often more important than performance per dollar.

Meanwhile, general-purpose computing capacity has a growth rate of 58% per annum, doubling every 18 months. Since these trends cancel rather neatly, the overall energy need is not changing significantly.

The push for low-power computing may make computing greener, and it might also make other domains more efficient by moving tasks to the virtual world, making them efficient and allowing better resource allocation. On the other hand, as things become cheaper and more efficient usage tends to go up, sometimes outweighing the gain. Which trend wins out in the long run is hard to predict.

Semilog plot of global energy consumption over time. — Semilog plot of global energy (all types) consumption over time.

Looking at overall energy use trends it looks like overall energy use increases exponentially (but has stayed at roughly the same per capita level since the 1970s). In fact, plotting it on a semilog graph suggests that it is increasing faster than exponential (otherwise it would be a straight line). This is presumably due to a combination of population increase and increased energy use. The best fit exponential has a doubling time of 44.8 years.

Electricity use is also roughly exponential, with a doubling time of 19.3 years. So we might be shifting more and more to electricity, and computing might be taking over more and more of that.

Extrapolating wildly, we would need the total solar input on Earth in about 300 years and the total solar luminosity in 911 years. In about 1,613 years we would have used up the solar system’s mass energy. So, clearly, long before then these trends will break one way or another.

Physics places a firm boundary due to the Landauer principle: in order to erase on bit of information $k T \ln(2)$ joules of energy have to be dissipated. Given current efficiency trends we will reach this limit around 2048.

The principle can be circumvented using reversible computation, either classical or quantum. But as I often like to point out, it still bites in the form of the need for error correction (erasing accidentally flipped bits) and formatting new computational resources (besides the work in turning raw materials into bits). We should hence expect a radical change in computation within a few decades, even if the cost per computation and second continues to fall exponentially.

What kind of singularity?

But how many joules of energy does a technological singularity actually need? It depends on what kind of singularity. In my own list of singularity meanings we have the following kinds:

A. Accelerating change
B. Self improving technology
C. Intelligence explosion
D. Emergence of superintelligence
E. Prediction horizon
F. Phase transition
G. Complexity disaster
H. Inflexion point
I. Infinite progress

Case A, acceleration, at first seems to imply increasing energy demands, but if efficiency grows faster they could of course go down.

Eric Chaisson has argued that energy rate density, how fast and densely energy get used (watts per kilogram), might be an indicator of complexity and growing according to a universal tendency. By this account, we should expect the singularity to have an extreme energy rate density – but it does not have to be using enormous amounts of energy if it is very small and light.

He suggests energy rate density may increase as Moore’s law, at least in our current technological setting. If we assume this to be true, then we would have $\Phi(t) = \exp(kt) = P(t)/M(t)$ , where $P(t)$ is the power of the system and $M(t)$ is the mass of the system at time t. One can maintain exponential growth by reducing the mass as well as increasing the power.

However, waste heat will need to be dissipated. If we use the simplest model where a radius R system with density $\rho$ radiates it away into space, then the temperature will be $T=[\rho \Phi R/3 \sigma]^{1/4}$ , or, if we have a maximal acceptable temperature, $R < 3\sigma T^4 / \rho \Phi$ . So the system needs to become smaller as $\Phi$ increases. If we use active heat transport instead (as outlined in my previous post), covering the surface with heat pipes that can remove X watts/square meter, then $R < 3 X$ $/ \Phi \rho$ . Again, the radius will be inversely proportional to $\Phi$ . This is similar to our current computers, where the CPU is a tiny part surrounded by cooling and energy supply.

If we assume the waste heat is just due to erasing bits, the rate of computation will be $I = P/kT \ln(2) = \Phi M / kT\ln(2) = [4 \pi \rho /3 k \ln(2)] \Phi R^3 / T$ bits per second. Using the first cooling model gives us $I \propto T^{11}/ \Phi^2$ – a massive advantage for running extremely hot and dense computation. In the second cooling model $I \propto \Phi^{-2}$ : in both cases higher energy rate densities make it harder to compute when close to the thermodynamic limit. Hence there might be an upper limit to how much we may want to push $\Phi$ .

Also, a system with mass M will use up its own mass-energy in time $Mc^2/P = c^2/\Phi$ : the higher the rate, the faster it will run out (and it is independent of size!). If the system is expanding at speed v it will gain and use up mass at a rate $M'= 4\pi\rho v t^2 - M\Phi(t)/c^2$ ; if $\Phi$ grows faster than quadratic with time it will eventually run out of mass to use. Hence the exponential growth must eventually reduce simply because of the finite lightspeed.

The Chaisson scenario does not suggest a “sustainable” singularity. Rather, it suggests a local intense transformation involving small, dense nuclei using up local resources. However, such local “detonations” may then spread, depending on the long-term goals of involved entities.

Cases B, C, D (intelligence explosions, superintelligence) have an unclear energy profile. We do not know how complex code would become or what kind of computational search is needed to get to superintelligence. It could be that it is more a matter of smart insights, in which case the needs are modest, or a huge deep learning-like project involving massive amounts of data sloshing around, requiring a lot of energy.

Case E, a prediction horizon, is separate from energy use. As this essay shows, there are some things we can say about superintelligent computational systems based on known physics that likely remains valid no matter what.

Case F, phase transition, involves a change in organisation rather than computation, for example the formation of a global brain out of previously uncoordinated people. However, this might very well have energy implications. Physical phase transitions involve discontinuities of the derivatives of the free energy. If the phases have different entropies (first order transitions) there has to be some addition or release of energy. So it might actually be possible that a societal phase transition requires a fixed (and possibly large) amount of energy to reorganize everything into the new order.

There are also second order transitions. These are continuous do not have a latent heat, but show divergent susceptibilities (how much the system responds to an external forcing). These might be more like how we normally imagine an ordering process, with local fluctuations near the critical point leading to large and eventually dominant changes in how things are ordered. It is not clear to me that this kind of singularity would have any particular energy requirement.

Case G, complexity disaster, is related to superexponential growth, such as the city growth model of Bettancourt, West et al. or the work on bubbles and finite time singularities by Didier Sornette. Here the rapid growth rate leads to a crisis, or more accurately a series of crises increasingly rapidly succeeding each other until a final singularity. Beyond that the system must behave in some different manner. These models typically predict rapidly increasing resource use (indeed, this is the cause of the crisis sequence as one kind of growth runs into resource scaling problems and is replaced with another one), although as Sornette points out the post-singularity state might well be a stable non-rivalrous knowledge economy.

Case H, an inflexion point, is very vanilla. It would represent the point where our civilization is halfway from where we started to where we are going. It might correspond to “peak energy” where we shift from increasing usage to decreasing usage (for whatever reason), but it does not have to. It could just be that we figure out most physics and AI in the next decades, become a spacefaring posthuman civilization, and expand for the next few billion years, using ever more energy but not having the same intense rate of knowledge growth as during the brief early era when we went from hunter gatherers to posthumans.

Case I, infinite growth, is not normally possible in the physical universe. Information can as far as we know not be stored beyond densities set by the Bekenstein bound ( $I \leq k_I MR$ where $k_I\approx 2.577\cdot 10^{43}$ bits per kg per meter), and we only have access to a volume $4 \pi c^3 t^3/3$ with mass density $\rho$ , so the total information growth must be bounded by $I \leq 4 \pi k_I c^4 \rho t^4/3$ . It grows quickly, but still just polynomially.

The exception to the finitude of growth is if we approach the boundaries of spacetime. Frank J. Tipler’s omega point theory shows how information processing could go infinite in a finite (proper) time in the right kind of collapsing universe with the right kind of physics. It doesn’t look like we live in one, but the possibility is tantalizing: could we arrange the right kind of extreme spacetime collapse to get the right kind of boundary for a mini-omega? It would be way beyond black hole computing and never be able to send back information, but still allow infinite experience. Most likely we are stuck in finitude, but it won’t hurt poking at the limits.

Conclusions

Indefinite exponential growth is never possible for physical properties that have some resource limitation, whether energy, space or heat dissipation. Sooner or later they will have to shift to a slower rate of growth – polynomial for expanding organisational processes (forced to this by the dimensionality of space, finite lightspeed and heat dissipation), and declining growth rate for processes dependent on a non-renewable resource.

That does not tell us much about the energy demands of a technological singularity. We can conclude that it cannot be infinite. It might be high enough that we bump into the resource, thermal and computational limits, which may be what actually defines the singularity energy and time scale. Technological singularities may also be small, intense and localized detonations that merely use up local resources, possibly spreading and repeating. But it could also turn out that advanced thinking is very low-energy (reversible or quantum) or requires merely manipulation of high level symbols, leading to a quiet singularity.

My own guess is that life and intelligence will always expand to fill whatever niche is available, and use the available resources as intensively as possible. That leads to instabilities and depletion, but also expansion. I think we are – if we are lucky and wise – set for a global conversion of the non-living universe into life, intelligence and complexity, a vast phase transition of matter and energy where we are part of the nucleating agent. It might not be sustainable over cosmological timescales, but neither is our universe itself. I’d rather see the stars and planets filled with new and experiencing things than continue a slow dance into the twilight of entropy.

…contemplate the marvel that is existence and rejoice that you are able to do so. I feel I have the right to tell you this because, as I am inscribing these words, I am doing the same.
– Ted Chiang, Exhalation

Risky and rewarding robots

Yesterday I participated in recording a radio program about robotics, and I noted that the participants were approaching the issue from several very different angles:

Robots as symbols: what we project things on them, what this says about humanity, how we change ourselves in respect to them, the role of hype and humanity in our thinking about them.
Robots as practical problem: how do you make a safe and trustworthy autonomous device that hangs around people? How do we handle responsibility for complex distributed systems that can generate ‘new’ behaviour?
Automation and jobs: what kinds of jobs are threatened or changed by automation? How does it change society, and how do we steer it in desirable directions – and what are they?
Long-term risks: how do we handle the potential risks from artificial general intelligence, especially given that many people think there are absolutely no problem and others are convinced that this could be existential if we do not figure out enough before it emerges?

In many cases the discussion got absurd because we talked past each other due to our different perspectives, but there were also some nice synergies. Trying to design automation without taking the anthropological and cultural aspects into account will lead to something that either does not work well with people or forces people to behave more machinelike. Not taking past hype cycles into account when trying to estimate future impact leads to overconfidence. Assuming that just because there has been hype in the past nothing will change is equally overconfident. The problems of trustworthiness and responsibility distribution become truly important when automating many jobs: when the automation is an essential part of the organisation, there needs to be mechanisms to trust it and to avoid dissolution of responsibility. Currently robot ethics is more about how humans are impacted by robots rather than ethics for robots, but the latter will become quite essential if we get closer to AGI.

Jobs

I focused on jobs, starting from the Future of Employment paper. Maarten Goos and Alan Manning pointed out that automation seems to lead to a polarisation into “lovely and lousy jobs“: more non-routine manual jobs (lousy), more non-routine cognitive jobs (lovely). The paper strongly supports this, showing that a large chunk of occupations that rely on routine tasks might be possible to automate but things requiring hand-eye coordination, human dexterity, social ability, creativity and intelligence – especially applied flexibly – are pretty safe.

Overall, the economist’s view is relatively clear: automation that embodies skills and ability to do labour can only affect the distribution of jobs and how much certain skills are valued and paid compared with others. There is no rule that if task X can be done by a machine it will be done by a machine: handmade can still pay premium, and the law of comparative advantage might mean it is not worth using the machine to do X when it can do the even more profitable task Y. Still, being entirely dependent on doing X for your living is likely a bad situation.

Also, we often underestimate the impact of “small” parts of tasks that in formal analysis don’t seem to matter. Underwriters are on paper eminently replaceable… except that the ability to notice “Hey! Those numbers don’t make sense” or judge the reliability of risk models is quite hard to implement, and actually may constitute most of their value. We care about hard to automate things like social interaction and style. And priests, politicians, prosecutors and prostitutes are all fairly secure because their jobs might inherently require being a human or representing a human.

However, the development of AI ability is not a continuous predictable curve. We get sudden surprises like the autonomous cars (just a few years ago most people believed autonomous cars were a very hard, nearly impossible problem) or statistical translation. Confluences of technology conspire to change things radically (consider the digital revolution of printing, both big and small, in the 80s that upended the world for human printers). And since we know we are simultaneously overhyping and missing trends, this should not give us a sense of complacency at all. Just because we have always failed to automate X in the past doesn’t mean X might not suddenly turn out to be automateable tomorrow: relying on X being stably in the human domain is a risky assumption, especially when thinking about career choices.

Scaling

Robots also have another important property: we can make a lot of them if we have a reason. If there is a huge demand for humans doing X we need to retrain or have children who grow up to be Xers. That makes the price go up a lot. Robots can be manufactured relatively easily, and scaling up the manufacturing is cheaper: even if X-robots are fairly expensive, making a lot more X-robots might be cheaper than trying to get humans if X suddenly matters.

This scaling is a bit worrisome, since robots implement somebody’s action plan (maybe badly, maybe dangerously creatively): they are essentially an extension of somebody or something’s preferences. So if we could make robot soldiers, the group or side that could make the most would have a potential huge strategic advantage. Making innovations in fast manufacture becomes important, in turn leading to a situation where there is an incentive for an arms race in being able to get an army by a press of a button. This is where I think atomically precise manufacturing is potentially risky: it might enable very quick builds, and that is potentially destabilizing. But even just automatic production (remember, this is a scenario where some robotics is good enough to implement useful military action, so manufacturing robotics will be advanced too). Also, countries running mostly on export on raw materials, if they automate the production there might not be much of a need of most of the population… An economist would say the population might be used for other profitable activities, but many nasty resource-driven governments do not invest in their human capital very much. In fact, they tend to see it as a security problem.

Of course, if we ever get to the level where intellectual tasks and services close to the human scale can be done, the same might apply to more developed economies too. But at that point we are so close to automating the task of making robots and AI better that I expect an intelligence explosion to occur before any social explosions. A society where nobody needs to work might sound nice and might be very worth striving for, but in order to get there we need at the very least get close to general AI and solve its safety problems.

Andart II

Part of Anders' Exoself

Future Studies

Bayes’ Broadsword

Forecasting

Statistical machine learning

Group cognition

Conclusions

Appendix: individual classifiers

Energy requirements of the singularity

Current trends

What kind of singularity?

Conclusions

Risky and rewarding robots

Jobs

Scaling