Don’t Worry – It Can’t Happen

(Originally a twitter thread)

When @fermatslibrary  brought up this 1940 paper about why we have nothing to worry about from nuclear chain reactions, I first checked that it was real and not a modern forgery. Because it seems almost too good to be true in the light of current AI safety talk.

Yes, the paper was real: Harrington, J. (1940). Don’t Worry—it Can’t Happen. Scientific American162(5), 268-268.

It gives a summary of a recent fission experiment that demonstrate a chain reaction where neutrons released from a split atom induces other atoms to split. The article claimed this caused widespread unease:
“Wasn’t there a dangerous possibility that the uranium would at last become explosive ? That the samples being bombarded in the laboratories at Columbia University, for example, might blow up the whole of New York City ? To make matters more ominous, news of fission research from Germany, plentiful in the early part of 1 939, mysteriously and abruptly stopped for some months. Had government censorship been placed on what might be a secret of military importance ?
The press and populace, getting wind of these possibly lethal goings-on, raised a hue and cry.”
However, physicists were unafraid to of being blown up (and blowing up the rest of the world).
“Nothing daunted, however, the physicists worked on to find out whether or not they would be blown up, and the rest of us along with them.”
Then comes a good description of the recent French experiment in making a self-sustaining chain reaction. The resulting neutrons are too fast to interact much with other atoms, making the net number dwindle despite a few initial induced fission.  And since it runs out, there is no risk.
There are some caveats, but don’t worry, scientific consensus seems to be firmly on the safety side!

“With typical French – and scientific – caution, they added that this was per haps true only for the particular conditions of their own experiment, which was carried out on a large mass of uranium under water. But most scientists agreed that it was very likely true in general.”

This article was 2 years before the Manhattan Project started, so it is unlikely to have been due to deliberate disinformation: it is an honest take on the state of knowledge at the time. Except of course that there was actually a fair bit to worry about soon…
(Foreword from apps.dtic.mil/sti/tr/pdf/ADA , commenting from the coldest part of the Cold War some decades later.)
Note that the concern in the article was merely self-sustaining fission chain reactions, not the atmospheric ignition by fusion discussed later in the Manhattan project and dealt with in the famous (in existential risk circles) report E. J. Konopinski, C. Marvin, and E. Teller, “Ignition of the Atmosphere with Nuclear Bombs,” Los Alamos National Laboratory, LA-602, April 1946. The idea that nuclear chain reactions could blow up the world was actually decades old by this time, a trope or motif that had emerged in the earliest years of the 20th century. Given that the energy content in atomic nuclei was known to be vast, that fissile isotopes occur throughout the Earth, and the possibility of a chain reaction at least conceivable after Leo Szilard’s insight in 1933, this was not entirely taken out of thin air.
Human engineering can change conditions *deliberately* to slow down neutrons with a moderator (making a reactor) or use an isotope where hot neutrons cause fission (the atomic bomb). The natural state is not a reliable indicator of the technical state.
It cannot have escaped the contemporary reader that this is very similar to many claims AI will remain safe. It is not reliable enough to self-improve or perform nefarious tasks well, so the chain reaction runs down. Surely nobody can make an AI moderator or find AI plutonium!
More generally, this seems a common argument failure mode: solid empirical evidence against something within known conditions cannot just be extrapolated reliably outside the conditions. What is needed is for the argument to work is (1) the conditions cannot be changed, (2) the result can be smoothly extrapolated, or (3) the impossibility needs to be relevant to the risk.
For nuclear chain reactions both (1) and (2) were wrong (moderators and plutonium). Arguments that AI will always hallucinate may be true, but that does not mean safety follows, since hallucinating humans (the results apply equally to us) are clearly potentially risky.
I think this is a relative to Arthur C. Clarke’s “failure of nerve” (not following extrapolation implications, often leading to overconfident impossibility claims) and “failure of imagination” (not looking outside the known domain or acknowledging there could be anything out there) he discusses in (1982). Profiles of the Future: An Inquiry to the Limits of the Possible.
Also, when reading the article I thought about my discussions with Tom Moynihan about how many tropes are earlier than the discoveries or events enabling them to become real – in the Scientific American article we already have the planet-destroying explosion and scientists “going dark” for military secrecy.

The funny thing is that this allows enlightened writers to poke fun at those naive people who merely believe in tropes, rather than the real science. The problem is that sometimes we make tropes true.

Popper vs. Macrohistory: what can we really say about the long-term future?

Talk I gave at the Oxford Karl Popper Society:

The quick summary: Physical eschatology, futures studies and macrohistory try to talk about the long-term future in different ways. Karl Popper launched a broadside against historicism, the approach to the social sciences which assumes that historical prediction is their principal aim. While the main target was the historicism supporting socialism and fascism, the critique has also scared away many from looking at the future – a serious problem for making the social sciences useful. In the talk I look at various aspects of Popper’s critique and how damaging they are. Some parts are fairly unproblematic because they demand too high precision or determinism, and can be circumvented by using a more Bayesian approach. His main point about knowledge growth making the future impossible to determine still stands and is a major restriction on what we can say – yet there are some ways to reason about the future even with this restriction. The lack of ergodicity of history may be a new problem to recognize: we should not think it would repeat if we re-run it. That does not rule out local patterns, but the overall endpoint appears random… or perhaps selectable. Except that doing it may turn out to be very, very hard.

My main conclusions are that longtermist views like most Effective Altruism are not affected much by the indeterminacy of Popper’s critique (or the non-ergodicity issue); here the big important issue is how much we can affect the future. That seems to be an open question, well worth pursuing. Macrohistory may be set for a comeback, especially if new methodologies in experimental history, big data history, or even Popper’s own “technological social science” were developed. That one cannot reach certitude does not prevent relevant and reliable (enough) input to decisions in some domains. Knowing which domains that are is another key research issue. In futures studies the critique is largely internalized by now, but it might be worth telling other disciplines about it. To me the most intriguing conclusion is that physical eschatology needs to take the action of intelligent life into account – and that means accepting some pretty far-reaching indeterminacy and non-ergodicity on vast scales.

Thinking long-term, vast and slow

John Fowler "Long Way Down" https://www.flickr.com/photos/snowpeak/10935459325
John Fowler “Long Way Down” https://www.flickr.com/photos/snowpeak/10935459325

This spring Richard Fisher at BBC Future has commissioned a series of essays about long-termism: Deep Civilisation. I really like this effort (and not just because I get the last word):

“Deep history” is fascinating because it gives us a feeling of the vastness of our roots – not just the last few millennia, but a connection to our forgotten stone-age ancestors, their hominin ancestors, the biosphere evolving over hundreds of millions and billions of years, the planet, and the universe. We are standing on top of a massive sedimentary cliff of past, stretching down to an origin unimaginably deep below.

Yet the sky above, the future, is even more vast and deep. Looking down the 1,857 m into Grand Canyon is vertiginous. Yet above us the troposphere stretches more than five times further up, followed by an even vaster stratosphere and mesosphere, in turn dwarfed by the thermosphere… and beyond the exosphere fades into the endlessness of deep space. The deep future is in many ways far more disturbing since it is moving and indefinite.

That also means there is a fair bit of freedom in shaping it. It is not very easy to shape. But if we want to be more than just some fossils buried inside the rocks we better do it.

What kinds of grand futures are there?

I have been working for about a year on a book on “Grand Futures” – the future of humanity, starting to sketch a picture of what we could eventually achieve were we to survive, get our act together, and reach our full potential. Part of this is an attempt to outline what we know is and isn’t physically possible to achieve, part of it is an exploration of what makes a future good.

Here are some things that appear to be physically possible (not necessarily easy, but doable):

  • Societies of very high standards of sustainable material wealth. At least as rich (and likely far above) current rich nation level in terms of what objects, services, entertainment and other lifestyle ordinary people can access.
  • Human enhancement allowing far greater health, longevity, well-being and mental capacity, again at least up to current optimal levels and likely far, far beyond evolved limits.
  • Sustainable existence on Earth with a relatively unchanged biosphere indefinitely.
  • Expansion into space:
    • Settling habitats in the solar system, enabling populations of at least 10 trillion (and likely many orders of magnitude more)
    • Settling other stars in the milky way, enabling populations of at least 1029 people
    • Settling over intergalactic distances, enabling populations of at least 1038 people.
  • Survival of human civilisation and the species for a long time.
    • As long as other mammalian species – on the order of a million years.
    • As long as Earth’s biosphere remains – on the order of a billion years.
    • Settling the solar system – on the order of 5 billion years
    • Settling the Milky Way or elsewhere – on the order of trillions of years if dependent on sunlight
    • Using artificial energy sources – up to proton decay, somewhere beyond 1032 years.
  • Constructing Dyson spheres around stars, gaining energy resources corresponding to the entire stellar output, habitable space millions of times Earth’s surface, telescope, signalling and energy projection abilities that can reach over intergalactic distances.
  • Moving matter and objects up to galactic size, using their material resources for meaningful projects.
  • Performing more than a google (10100) computations, likely far more thanks to reversible and quantum computing.

While this might read as a fairly overwhelming list, it is worth noticing that it does not include gaining access to an infinite amount of matter, energy, or computation. Nor indefinite survival. I also think faster than light travel is unlikely to become possible. If we do not try to settle remote galaxies within 100 billion years accelerating expansion will move them beyond our reach. This is a finite but very large possible future.

What kinds of really good futures may be possible? Here are some (not mutually exclusive):

  • Survival: humanity survives as long as it can, in some form.
  • “Modest futures”: humanity survives for as long as is appropriate without doing anything really weird. People have idyllic lives with meaningful social relations. This may include achieving close to perfect justice, sustainability, or other social goals.
  • Gardening: humanity maintains the biosphere of Earth (and possibly other planets), preventing them from crashing or going extinct. This might include artificially protecting them from a brightening sun and astrophysical disasters, as well as spreading life across the universe.
  • Happiness: humanity finds ways of achieving extreme states of bliss or other positive emotions. This might include local enjoyment, or actively spreading minds enjoying happiness far and wide.
  • Abolishing suffering: humanity finds ways of curing negative emotions and suffering without precluding good states. This might include merely saving humanity, or actively helping all suffering beings in the universe.
  • Posthumanity: humanity deliberately evolves or upgrades itself into forms that are better, more diverse or otherwise useful, gaining access to modes of existence currently not possible to humans but equally or more valuable.
  • Deep thought: humanity develops cognitive abilities or artificial intelligence able to pursue intellectual pursuits far beyond what we can conceive of in science, philosophy, culture, spirituality and similar but as yet uninvented domains.
  • Creativity: humanity plays creatively with the universe, making new things and changing the world for its own sake.

I have no doubt I have missed many plausible good futures.

Note that there might be moral trades, where stay-at-homes agree with expansionists to keep Earth an idyllic world for modest futures and gardening while the others go off to do other things, or long-term oriented groups agreeing to give short-term oriented groups the universe during the stelliferous era in exchange for getting it during the cold degenerate era trillions of years in the future. Real civilisations may also have mixtures of motivations and sub-groups.

Note that the goals and the physical possibilities play out very differently: modest futures do not reach very far, while gardener civilisations may seek to engage in megascale engineering to support the biosphere but not settle space. Meanwhile the happiness-maximizers may want to race to convert as much matter as possible to hedonium, while the deep thought-maximizers may want to move galaxies together to create permanent hyperclusters filled with computation to pursue their cultural goals.

I don’t know what goals are right, but we can examine what they entail. If we see a remote civilization doing certain things we can make some inferences on what is compatible with the behaviour. And we can examine what we need to do today to have the best chances of getting to a trajectory towards some of these goals: avoiding getting extinct, improve our coordination ability, and figure out if we need to perform some global coordination in the long run that we need to agree on before spreading to the stars.

Checking my predictions for 2016

Last year I made a number of predictions for 2016 to see how well calibrated I am. Here is the results:

Prediction Correct?
No nuclear war: 99% 1
No terrorist attack in the USA will kill > 100 people: 95% 1 (Orlando: 50)
I will be involved in at least one published/accepted-to-publish research paper by the end of 2015: 95% 1
Vesuvius will not have a major eruption: 95% 1
I will remain at my same job through the end of 2015: 90% 1
MAX IV in Lund delivers X-rays: 90% 1
Andart II will remain active: 90% 1
Israel will not get in a large-scale war (ie >100 Israeli deaths) with any Arab state: 90% 1
US will not get involved in any new major war with death toll of > 100 US soldiers: 90% 1
New Zeeland has not decided to change current flag at end of year: 85% 1
No multi-country Ebola outbreak: 80% 1
Assad will remain President of Syria: 80% 1
ISIS will control less territory than it does right now: 80% 1
North Korea’s government will survive the year without large civil war/revolt: 80% 1
The US NSABB will allow gain of function funding: 80% 1 [Their report suggests review before funding, currently it is up to the White House to respond. ]

 

US presidential election: democratic win: 75% 0
A general election will be held in Spain: 75% 1
Syria’s civil war will not end this year: 75% 1
There will be no NEO with Torino Scale >0 on 31 Dec 2016: 75% 0 (2016 XP23 showed up on the scale according to JPL, but NEODyS Risk List gives it a zero.)
The Atlantic basin ACE will be below 96.2: 70% 0 (ACE estimate on Jan 1 is 132)
Sweden does not get a seat on the UN Security Council: 70% 0
Bitcoin will end the year higher than $200: 70% 1
Another major eurozone crisis: 70% 0
Brent crude oil will end the year lower than $60 a barrel: 70% 1
I will actually apply for a UK citizenship: 65% 0
UK referendum votes to stay in EU: 65% 0
China will have a GDP growth above 5%: 65% 1
Evidence for supersymmetry: 60% 0
UK larger GDP than France: 60% 1 (although it is a close call; estimates put France at 2421.68 and UK at 2848.76 – quite possibly this might change)
France GDP growth rate less than 2%: 60% 1
I will have made significant progress (4+ chapters) on my book: 55% 0
Iran nuclear deal holding: 50% 1
Apple buys Tesla: 50% 0
The Nikkei index ends up above 20,000: 50% 0 (nearly; the Dec 20 max was 19,494)

Overall, my Brier score is 0.1521. Which doesn’t feel too bad.

Plotting the results (where I bin together things in [0.5,0.55], [0.5,0.65], [0.7 0.75], [0.8,0.85], [0.9,0.99] bins) give this calibration plot:

Plot of average correctness of my predictions for 2016 as a function of confidence.
Plot of average correctness of my predictions for 2016 as a function of confidence (blue). Red line is perfect calibration.

Overall, I did great on my “sure bets” and fairly weakly on my less certain bets. I did not have enough questions to make this very statistically solid (coming up with good prediction questions is hard!), but the overall shape suggests that I am a bit overconfident, which is not surprising.

Time to come up with good 2017 prediction questions.

A crazy futurist writes about crazy futurists

Arjen the doomsayerWarren Ellis’ Normal is a little story about the problem of being serious about the future.

As I often point out, most people in the futures game are basically in the entertainment industry: telling wonderful or frightening stories that allow us to feel part of a bigger sweep of history, reflect a bit, and then return to the present with the reassurance that we have some foresight. Relatively little future studies is about finding decision-relevant insights and then acting on it. It exists, but it is not the bulk of future-oriented people. Taking the future seriously might require colliding with your society as you try to tell it it is going the wrong way. Worse, the conclusions might tell you that your own values and goals are wrong.

Normal takes place at a sanatorium for mad futurists in the wilds of Oregon. The idea is that if you spend too much time thinking too seriously about the big and horrifying things in the future mental illness sets in. So when futurists have nervous breakdowns they get sent by their sponsors to Normal to recover. They are useful, smart, and dedicated people but since the problems they deal with are so strange their conditions are equally unusual. The protagonist arrives just in time to encounter a bizarre locked room mystery – exactly the worst kind of thing for a place like Normal with many smart and fragile minds – driving him to investigate what is going on.

As somebody working with the future, I think the caricatures of these futurists (or rather their ideas) are spot on. There are the urbanists, the singularitarians, the neoreactionaries, the drone spooks, and the invented professional divisions. Of course, here they are mad in a way that doesn’t allow them to function in society which softballs the views: singletons and Molochs are serious real ideas that should make your stomach lurch.

The real people I know who take the future seriously are overall pretty sane. I remember a documentary filmmaker at a recent existential risk conference mildly complaining that people where so cheerful and well-adapted: doubtless some darkness and despair would have made a far more compelling imagery than chummy academics trying to salvage the bioweapons convention. Even the people involved in developing the Mutually Assured Destruction doctrine seem to have been pretty healthy. People who go off on the deep end tend to do it not because of The Future but because of more normal psychological fault lines. Maybe we are not taking the future seriously enough, but I suspect it is more a case of an illusion of control: we know we are at least doing something.

This book convinced me that I need to seriously start working on my own book project, the “glass is half full” book. Much of our research at FHI seems to be relentlessly gloomy: existential risk, AI risk, all sorts of unsettling changes to the human condition that might slurp us down into a valueless attractor asymptoting towards the end of time. But that is only part of it: there are potential futures so bright that we do not just need sunshades, but we have problems with even managing the positive magnitude in an intellectually useful way. The reason we work on existential risk is that we (1) think there is enormous positive potential value at stake, and (2) we think actions can meaningfully improve chances. That is no pessimism, quite the opposite. I can imagine Ellis or one of his characters skeptically looking at me across the table at Normal and accusing me of solutionism and/or a manic episode. Fine. I should lay out my case in due time, with enough logos, ethos and pathos to convince them (Muhahaha!).

I think the fundamental horror at the core of Normal – and yes, I regard this more as a horror story than a techno-thriller or satire – is the belief that The Future is (1) pretty horrifying and (2) unstoppable. I think this is a great conceit for a story and a sometimes necessary intellectual tonic to consider. But it is bad advice for how to live a functioning life or actually make a saner future.

 

Settling Titan, Schneier’s Law, and scenario thinking

Charles Wohlforth and Amanda R. Hendrix want us to colonize Titan. The essay irritated me in an interesting manner.

Full disclosure: they interviewed me while they were writing their book Beyond Earth: Our Path to a New Home in the Planets, which I have not read yet, and I will only be basing the following on the SciAm essay. It is not really about settling Titan either, but something that bothers me with a lot of scenario-making.

A weak case for Titan and against Luna and Mars

titan2dmapBasically the essay outlines reasons why other locations in the solar system are not good: Mercury too hot, Venus way too hot, Mars and Luna have too much radiation. Only Titan remains, with a cold environment but not too much radiation.

A lot of course hinges on the assumptions:

We expect human nature to stay the same. Human beings of the future will have the same drives and needs we have now. Practically speaking, their home must have abundant energy, livable temperatures and protection from the rigors of space, including cosmic radiation, which new research suggests is unavoidably dangerous for biological beings like us.

I am not that confident in that we will remain biological or vulnerable to radiation. But even if we decide to accept the assumptions, the case against the Moon and Mars is odd:

Practically, a Moon or Mars settlement would have to be built underground to be safe from this radiation.Underground shelter is hard to build and not flexible or easy to expand. Settlers would need enormous excavations for room to supply all their needs for food, manufacturing and daily life.

So making underground shelters is much harder than settling Titan, where buildings need to be isolated against a -179 C atmosphere and ice ground full with complex and quite likely toxic hydrocarbons. They suggest that there is no point in going to the moon to live in an underground shelter when you can do it on Earth, which is not too unreasonable – but is there a point in going to live inside an insulated environment on Titan either? The actual motivations would likely be less of a desire for outdoor activities and more scientific exploration, reducing existential risk, and maybe industrialization.

Also, while making underground shelters in space may be hard, it does not look like an insurmountable problem. The whole concern is a bit like saying submarines are not practical because the cold of the depths of the ocean will give the crew hypothermia – true, unless you add heating.

I think this is similar to Schneier’s law:

Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break.

It is not hard to find a major problem with a possible plan that you cannot see a reasonable way around. That doesn’t mean there isn’t one.

Settling for scenarios

9 of Matter: The Planet GardenMaybe Wohlforth and Hendrix spent a lot of time thinking about lunar excavation issues and consistent motivations for settlements to reach a really solid conclusion, but I suspect that they came to the conclusion relatively lightly. It produces an interesting scenario: Titan is not the standard target when we discuss where humanity ought to go, and it is an awesome environment.

Similarly the “humans will be humans” scenario assumptions were presumably chosen not after a careful analysis of relative likelihood of biological and postbiological futures, but just because it is similar to the past and makes an interesting scenario. Plus human readers like reading about humans rather than robots. All together it makes for a good book.

Clearly I have different priors compared to them on the ease and rationality of Lunar/Martian excavation and postbiology. Or even giving us D. radiodurans genes.

In The Age of Em Robin Hanson argues that if we get the brain emulation scenario space settlement will be delayed until things get really weird: while postbiological astronauts are very adaptable, so much of the mainstream of civilization will be turning inward towards a few dense centers (for economics and communications reasons). Eventually resource demand, curiosity or just whatever comes after the Age of Ems may lead to settling the solar system. But that process will be pretty different even if it is done by mentally human-like beings that do need energy and protection. Their ideal environments would be energy-gradient rich, with short communications lags: Mercury, slowly getting disassembled into a hot Dyson shell, might be ideal. So here the story will be no settlement, and then wildly exotic settlement that doesn’t care much about the scenery.

But even with biological humans we can imagine radically different space settlement scenarios, such as the Gerhard O’Neill scenario where planetary surfaces are largely sidestepped for asteroids and space habitats. This is Jeff Bezo’s vision rather than Elon Musk’s and Wohlforth/Hendrix’s. It also doesn’t tell the same kind of story: here our new home is not in the planets but between them.

My gripe is not against settling Titan, or even thinking it is the best target because of some reasons. It is against settling too easily for nice scenarios.

Beyond the good story

Sometimes we settle for scenarios because they tell a good story. Sometimes because they are amenable to study among other, much less analyzable possibilities. But ideally we should aim at scenarios that inform us in a useful way about options and pathways we have.

That includes making assumptions wide enough to cover relevant options, even the less glamorous or tractable ones.

That requires assuming future people will be just as capable (or more) at solving problems: just because I can’t see a solution to X doesn’t mean it is not trivially solved in the future.

(Maybe we could call it the “Manure Principle” after the canonical example of horse manure being seen as a insoluble urban planning problem at the previous turn of century and then neatly getting resolved by unpredicted trams and cars – and just like Schneier’s law and Stigler’s law the reality is of course more complex than the story.)

In standard scenario literature there are often admonitions not just to select a “best case scenario”, “worst case scenario” and “business as usual scenario” – scenario planning comes into its own when you see nontrivial, mixed value possibilities. In particular, we want decision-relevant scenarios that make us change what we will do when we hear about them (rather than good stories, which entertain but do not change our actions). But scenarios on their own do not tell us how to make these decisions: they need to be built from our rationality and decision theory applied to their contents. Easy scenarios make it trivial to choose (cake or death?), but those choices would have been obvious even without the scenarios: no forethought needed except to bring up the question. Complex scenarios force us to think in new ways about relevant trade-offs.

The likelihood of complex scenarios is of course lower than simple scenarios (the conjunction fallacy makes us believe much more in rich stories). But if they are seen as tools for developing decisions rather than information about the future, then their individual probability is less of an issue.

In the end, good stories are lovely and worth having, but for thinking and deciding carefully we should not settle for just good stories or the scenarios that feel neat.

 

 

How much should we spread out across future scenarios?

Robin Hanson mentions that some people take him to task for working on one scenario (WBE) that might not be the most likely future scenario (“standard AI”); he responds by noting that there are perhaps 100 times more people working on standard AI than WBE scenarios, yet the probability of AI is likely not a hundred times higher than WBE. He also notes that there is a tendency for thinkers to clump onto a few popular scenarios or issues. However:

In addition, due to diminishing returns, intellectual attention to future scenarios should probably be spread out more evenly than are probabilities. The first efforts to study each scenario can pick the low hanging fruit to make faster progress. In contrast, after many have worked on a scenario for a while there is less value to be gained from the next marginal effort on that scenario.

This is very similar to my own thinking about research effort. Should we focus on things that are likely to pan out, or explore a lot of possibilities just in case one of the less obvious cases happens? Given that early progress is quick and easy, we can often get a noticeable fraction of whatever utility the topic has by just a quick dip. The effective altruist heuristic of looking at neglected fields also is based on this intuition.

A model

But under what conditions does this actually work? Here is a simple model:

There are N possible scenarios, one of which (j) will come about. They have probability P_i. We allocate a unit budget of effort to the scenarios: \sum a_i = 1. For the scenario that comes about, we get utility \sqrt{a_j} (diminishing returns).

Here is what happens if we allocate proportional to a power of the scenarios, a_i \propto P_i^\alpha. \alpha=0 corresponds to even allocation, 1 proportional to the likelihood, >1 to favoring the most likely scenarios. In the following I will run Monte Carlo simulations where the probabilities are randomly generated each instantiation. The outer bluish envelope represents the 95% of the outcomes, the inner ranges from the lower to the upper quartile of the utility gained, and the red line is the expected utility.

Utility of allocating effort as a power of the probability of scenarios. Red line is expected utility, deeper blue envelope is lower and upper quartiles, lighter blue 95% interval.

This is the N=2 case: we have two possible scenarios with probability p and 1-p (where p is uniformly distributed in [0,1]). Just allocating evenly gives us 1/\sqrt{2} utility on average, but if we put in more effort on the more likely case we will get up to 0.8 utility. As we focus more and more on the likely case there is a corresponding increase in variance, since we may guess wrong and lose out. But 75% of the time we will do better than if we just allocated evenly. Still, allocating nearly everything to the most likely case means that one does lose out on a bit of hedging, so the expected utility declines slowly for large \alpha.

Utility of allocating effort as a power of the probability of scenarios. Red line is expected utility, deeper blue envelope is lower and upper quartiles, lighter blue 95% interval. 100 possible scenarios, with uniform probability on the simplex.
Utility of allocating effort as a power of the probability of scenarios. Red line is expected utility, deeper blue envelope is lower and upper quartiles, lighter blue 95% interval. 100 possible scenarios, with uniform probability on the simplex.

The  N=100 case (where the probabilities are allocated based on a flat Dirichlet distribution) behaves similarly, but the expected utility is smaller since it is less likely that we will hit the right scenario.

What is going on?

This doesn’t seem to fit Robin’s or my intuitions at all! The best we can say about uniform allocation is that it doesn’t produce much regret: whatever happens, we will have made some allocation to the possibility. For large N this actually works out better than the directed allocation for a sizable fraction of realizations, but on average we get less utility than betting on the likely choices.

The problem with the model is of course that we actually know the probabilities before making the allocation. In reality, we do not know the likelihood of AI, WBE or alien invasions. We have some information, and we do have priors (like Robin’s view that P_{AI} < 100 P_{WBE}), but we are not able to allocate perfectly.  A more plausible model would give us probability estimates instead of the actual probabilities.

We know nothing

Let us start by looking at the worst possible case: we do not know what the true probabilities are at all. We can draw estimates from the same distribution – it is just that they are uncorrelated with the true situation, so they are just noise.

Utility of allocating effort as a power of the probability of scenarios, but the probabilities are just random guesses. Red line is expected utility, deeper blue envelope is lower and upper quartiles, lighter blue 95% interval. 100 possible scenarios, with uniform probability on the simplex.
Utility of allocating effort as a power of the probability of scenarios, but the probabilities are just random guesses. Red line is expected utility, deeper blue envelope is lower and upper quartiles, lighter blue 95% interval. 100 possible scenarios, with uniform probability on the simplex.

In this case uniform distribution of effort is optimal. Not only does it avoid regret, it has a higher expected utility than trying to focus on a few scenarios (\alpha>0). The larger N is, the less likely it is that we focus on the right scenario since we know nothing. The rationality of ignoring irrelevant information is pretty obvious.

Note that if we have to allocate a minimum effort to each investigated scenario we will be forced to effectively increase our \alpha above 0. The above result gives the somewhat optimistic conclusion that the loss of utility compared to an even spread is rather mild: in the uniform case we have a pretty low amount of effort allocated to the winning scenario, so the low chance of being right in the nonuniform case is being balanced by having a slightly higher effort allocation on the selected scenarios. For high \alpha there is a tail of rare big “wins” when we hit the right scenario that drags the expected utility upwards, even though in most realizations we bet on the wrong case. This is very much the hedgehog predictor story: ocasionally they have analysed the scenario that comes about in great detail and get intensely lauded, despite looking at the wrong things most of the time.

We know a bit

We can imagine that knowing more should allow us to gradually interpolate between the different results: the more you know, the more you should focus on the likely scenarios.

Optimal alpha as a function of how much information we have about the true probabilities. N=2.
Optimal alpha as a function of how much information we have about the true probabilities (noise due to Monte Carlo and discrete steps of alpha). N=2 (N=100 looks similar).

If we take the mean of the true probabilities with some randomly drawn probabilities (the “half random” case) the curve looks quite similar to the case where we actually know the probabilities: we get a maximum for \alpha\approx 2. In fact, we can mix in just a bit (\beta) of the true probability and get a fairly good guess where to allocate effort (i.e. we allocate effort as a_i \propto (\beta P_i + (1-\beta)Q_i)^\alpha where Q_i is uncorrelated noise probabilities). The optimal alpha grows roughly linearly with \beta, \alpha_{opt} \approx 4\beta in this case.

We learn

Adding a bit of realism, we can consider a learning process: after allocating some effort \gamma to the different scenarios we get better information about the probabilities, and can now reallocate. A simple model may be that the standard deviation of noise behaves as 1/\sqrt{\tilde{a}_i} where \tilde{a}_i is the effort placed in exploring the probability of scenario i. So if we begin by allocating uniformly we will have noise at reallocation of the order of 1/\sqrt{\gamma/N}. We can set \beta(\gamma)=\sqrt{\gamma/N}/C, where C is some constant denoting how tough it is to get information. Putting this together with the above result we get \alpha_{opt}(\gamma)=\sqrt{2\gamma/NC^2}. After this exploration, now we use the remaining 1-\gamma effort to work on the actual scenarios.

Expected utility as a function of amount of probability-estimating effort (gamma) for C=1 (hard to update probabilities), C=0.1 and C=0.01 (easy to update). N=100.
Expected utility as a function of amount of probability-estimating effort (gamma) for C=1 (hard to update probabilities), C=0.1 and C=0.01 (easy to update). N=100.

This is surprisingly inefficient. The reason is that the expected utility declines as \sqrt{1-\gamma} and the gain is just the utility difference between the uniform case \alpha=0 and optimal \alpha_{opt}, which we know is pretty small. If C is small (i.e. a small amount of effort is enough to figure out the scenario probabilities) there is an optimal nonzero  \gamma. This optimum \gamma decreases as C becomes smaller. If C is large, then the best approach is just to spread efforts evenly.

Conclusions

So, how should we focus? These results suggest that the key issue is knowing how little we know compared to what can be known, and how much effort it would take to know significantly more.

If there is little more that can be discovered about what scenarios are likely, because our state of knowledge is pretty good, the world is very random,  or improving knowledge about what will happen will be costly, then we should roll with it and distribute effort either among likely scenarios (when we know them) or spread efforts widely (when we are in ignorance).

If we can acquire significant information about the probabilities of scenarios, then we should do it – but not overdo it. If it is very easy to get information we need to just expend some modest effort and then use the rest to flesh out our scenarios. If it is doable but costly, then we may spend a fair bit of our budget on it. But if it is hard, it is better to go directly on the object level scenario analysis as above. We should not expect the improvement to be enormous.

Here I have used a square root diminishing return model. That drives some of the flatness of the optima: had I used a logarithm function things would have been even flatter, while if the returns diminish more mildly the gains of optimal effort allocation would have been more noticeable. Clearly, understanding the diminishing returns, number of alternatives, and cost of learning probabilities better matters for setting your strategy.

In the case of future studies we know the number of scenarios are very large. We know that the returns to forecasting efforts are strongly diminishing for most kinds of forecasts. We know that extra efforts in reducing uncertainty about scenario probabilities in e.g. climate models also have strongly diminishing returns. Together this suggests that Robin is right, and it is rational to stop clustering too hard on favorite scenarios. Insofar we learn something useful from considering scenarios we should explore as many as feasible.

Predictions for 2016

The ever readable Slate Star Codex has a post about checking how accurate the predictions for 2015 were; overall Scott Alexander seems pretty well calibrated. Being a born follower I decided to make a bunch of predictions to check my calibration in a year’s time.

Here is my list of predictions, with my confidence (some predictions obviously stolen):

  • No nuclear war: 99%
  • No terrorist attack in the USA will kill > 100 people: 95%
  • I will be involved in at least one published/accepted-to-publish research paper by the end of 2015: 95%
  • Vesuvius will not have a major eruption: 95%
  • I will remain at my same job through the end of 2015: 90%
  • MAX IV in Lund delivers X-rays: 90%
  • Andart II will remain active: 90%
  • Israel will not get in a large-scale war (ie >100 Israeli deaths) with any Arab state: 90%
  • US will not get involved in any new major war with death toll of > 100 US soldiers: 90%
  • New Zeeland has not decided to change current flag at end of year: 85%
  • No multi-country Ebola outbreak: 80%
  • Assad will remain President of Syria: 80%
  • ISIS will control less territory than it does right now: 80%
  • North Korea’s government will survive the year without large civil war/revolt: 80%
  • The US NSABB will allow gain of function funding: 80%
  • US presidential election: democratic win: 75%
  • A general election will be held in Spain: 75%
  • Syria’s civil war will not end this year: 75%
  • There will be no NEO with Torino Scale >0 on 31 Dec 2016: 75%
  • The Atlantic basin ACE will be below 96.2: 70%
  • Sweden does not get a seat on the UN Security Council: 70%
  • Bitcoin will end the year higher than $200: 70%
  • Another major eurozone crisis: 70%
  • Brent crude oil will end the year lower than $60 a barrel: 70%
  • I will actually apply for a UK citizenship: 65%
  • UK referendum votes to stay in EU: 65%
  • China will have a GDP growth above 5%: 65%
  • Evidence for supersymmetry: 60%
  • UK larger GDP than France: 60%
  • France GDP growth rate less than 2%: 60%
  • I will have made significant progress (4+ chapters) on my book: 55%
  • Iran nuclear deal holding: 50%
  • Apple buys Tesla: 50%
  • The Nikkei index ends up above 20,000: 50%

The point is to have enough that we can see how my calibration works.

Looking for topics leads to amusing finds like the predictions of Nostradamus for 2015. Given that language barriers remain, the dead remain dead, lifespans are less than 200, there has not been a Big One in western US nor has Vesuvius erupted, and taxes still remain, I think we can conclude he was wrong or the ability to interpret him accurately is near zero. Which of course makes his quatrains equally useless.

Bayes’ Broadsword

Yesterday I gave a talk at the joint Bloomberg-London Futurist meeting “The state of the future” about the future of decisionmaking. Parts were updates on my policymaking 2.0 talk (turned into this chapter), but I added a bit more about individual decisionmaking, rationality and forecasting.

The big idea of the talk: ensemble methods really work in a lot of cases. Not always, not perfectly, but they should be among the first tools to consider when trying to make a robust forecast or decision. They are Bayes’ broadsword:

Bayesbroadsword

Forecasting

One of my favourite experts on forecasting is J Scott Armstrong. He has stressed the importance of evidence based forecasting, including checking how well different methods work. The general answer is: not very well, yet people keep on using them. He has been pointing this out since the 70s. It also turns out that expertise only gets you so far: expert forecasts are not very reliable either, and the accuracy levels out quickly with increasing level of expertise. One implication is that one should at least get cheap experts since they are about as good as the pricey ones. It is also known that simple models for forecasting tends to be more accurate than complex ones, especially in complex and uncertain situations (see also Haldane’s “The Dog and the Frisbee”). Another important insight is that it is often better to combine different methods than try to select the one best method.

Another classic look at prediction accuracy is Philip Tetlock’s Expert Political Judgment (2005) where he looked at policy expert predictions. They were only slightly more accurate than chance, worse than basic extrapolation algorithms, and there was a negative link to fame: high profile experts have an incentive to be interesting and dramatic, but not right. However, he noticed some difference between “hedgehogs” (people with One Big Theory) and “foxes” (people using multiple theories), with the foxes outperforming hedgehogs.

OK, so in forecasting it looks like using multiple methods, theories and data sources (including experts) is a way to get better results.

Statistical machine learning

A standard problem in machine learning is to classify something into the right category from data, given a set of training examples. For example, given medical data such as age, sex, and blood test results, diagnose what a particular disease a patient might suffer from. The key problem is that it is non-trivial to construct a classifier that works well on data different from the training data. It can work badly on new data, even if it works perfectly on the training examples. Two classifiers that perform equally well during training may perform very differently in real life, or even for different data.

The obvious solution is to combine several classifiers and average (or vote about) their decisions: ensemble based systems. This reduces the risk of making a poor choice, and can in fact improve overall performance if they can specialize for different parts of the data. This also has other advantages: very large datasets can be split into manageable chunks that are used to train different components of the ensemble, tiny datasets can be “stretched” by random resampling to make an ensemble trained on subsets, outliers can be managed by “specialists”, in data fusion different types of data can be combined, and so on. Multiple weak classifiers can be combined into a strong classifier this way.

The method benefits from having diverse classifiers that are combined: if they are too similar in their judgements, there is no advantage. Estimating the right weights to give to them is also important, otherwise a truly bad classifier may influence the output.

Iris data classified using an ensemble of classification methods.
Iris data classified using an ensemble of classification methods (LDA, NBC, various kernels, decision tree). Note how the combination of classifiers also roughly indicates the overall reliability of classifications in a region.

The iconic demonstration of the power of this approach was the Netflix Prize, where different teams competed to make algorithms that predicted user ratings of films from previous ratings. As part of the rules the algorithms were made public, spurring innovation. When the competition concluded in 2009, the leading teams all consisted of ensemble methods where component algorithms were from past teams. The two big lessons were (1) that a combination of not just the best algorithms, but also less accurate algorithms, were the key to winning, and (2) that organic organization allows the emergence of far better performance than having strictly isolated teams.

Group cognition

Condorcet’s jury theorem is perhaps the classic result in group problem solving: if a group of people hold a majority vote, and each has a probability p>1/2 of voting for the correct choice, then the probability the group will vote correctly is higher than p and will tend to approach 1 as the size of the group increases. This presupposes that votes are independent, although stronger forms of the theorem have been proven. (In reality people may have different preferences so there is no clear “right answer”)

Probability that groups of different sizes will reach the correct decision as a function of the individual probability of voting right.
Probability that groups of different sizes will reach the correct decision as a function of the individual probability of voting right.

By now the pattern is likely pretty obvious. Weak decision-makers (the voters) are combined through a simple procedure (the vote) into better decision-makers.

Group problem solving is known to be pretty good at smoothing out individual biases and errors. In The Wisdom of Crowds Surowiecki suggests that the ideal crowd for answering a question in a distributed fashion has diversity of opinion, independence (each member has an opinion not determined by the other’s), decentralization (members can draw conclusions based on local knowledge), and the existence of a good aggregation process turning private judgements into a collective decision or answer.

Perhaps the grandest example of group problem solving is the scientific process, where peer review, replication, cumulative arguments, and other tools make error-prone and biased scientists produce a body of findings that over time robustly (if sometimes slowly) tends towards truth. This is anything but independent: sometimes a clever structure can improve performance. However, it can also induce all sorts of nontrivial pathologies – just consider the detrimental effects status games have on accuracy or focus on the important topics in science.

Small group problem solving on the other hand is known to be great for verifiable solutions (everybody can see that a proposal solves the problem), but unfortunately suffers when dealing with “wicked problems” lacking good problem or solution formulation. Groups also have scaling issues: a team of N people need to transmit information between all N(N-1)/2 pairs, which quickly becomes cumbersome.

One way of fixing these problems is using software and formal methods.

The Good Judgement Project (partially run by Tetlock and with Armstrong on the board of advisers) participated in the IARPA ACE program to try to improve intelligence forecasts. They used volunteers and checked their forecast accuracy (not just if they got things right, but if claims that something was 75% likely actually came true 75% of the time). This led to a plethora of fascinating results. First, accuracy scores based on the first 25 questions in the tournament predicted subsequent accuracy well: some people were consistently better than others, and it tended to remain constant. Training (such a debiasing techniques) and forming teams also improved performance. Most impressively, using the top 2% “superforecasters” in teams really outperformed the other variants. The superforecasters were a diverse group, smart but by no means geniuses, updating their beliefs frequently but in small steps.

The key to this success was that a computer- and statistics-aided process found the good forecasters and harnessed them properly (plus, the forecasts were on a shorter time horizon than the policy ones Tetlock analysed in his previous book: this both enables better forecasting, plus the all-important feedback on whether they worked).

Another good example is the Galaxy Zoo, an early crowd-sourcing project in galaxy classification (which in turn led to the Zooniverse citizen science project). It is not just that participants can act as weak classifiers and combined through a majority vote to become reliable classifiers of galaxy type. Since the type of some galaxies is agreed on by domain experts they can used to test the reliability of participants, producing better weightings. But it is possible to go further, and classify the biases of participants to create combinations that maximize the benefit, for example by using overly “trigger happy” participants to find possible rare things of interest, and then check them using both conservative and neutral participants to become certain. Even better, this can be done dynamically as people slowly gain skill or change preferences.

The right kind of software and on-line “institutions” can shape people’s behavior so that they form more effective joint cognition than they ever could individually.

Conclusions

The big idea here is that it does not matter that individual experts, forecasting methods, classifiers or team members are fallible or biased, if their contributions can be combined in such a way that the overall output is robust and less biased. Ensemble methods are examples of this.

While just voting or weighing everybody equally is a decent start, performance can be significantly improved by linking it to how well the participants perform. Humans can easily be motivated by scoring (but look out for disalignment of incentives: the score must accurately reflect real performance and must not be gameable).

In any case, actual performance must be measured. If we cannot tell if some method is more accurate than something else, then either accuracy does not matter (because it cannot be distinguished or we do not really care), or we will not get the necessary feedback to improve it. It is known from the expertise literature that one of the key factors for it to be possible to become an expert on a task is feedback.

Having a flexible structure that can change is a good approach to handling a changing world. If people have disincentives to change their mind or change teams, they will not update beliefs accurately.

I got a good question after the talk: if we are supposed to keep our models simple, how can we use these complicated ensembles? The answer is of course that there is a difference between using a complex and a complicated approach. The methods that tend to be fragile are the ones with too many free parameters, too much theoretical burden: they are the complex “hedgehogs”. But stringing together a lot of methods and weighting them appropriately merely produces a complicated model, a “fox”. Component hedgehogs are fine as long as they are weighed according to how well they actually perform.

(In fact, adding together many complex things can make the whole simpler. My favourite example is the fact that the Kolmogorov complexity of integers grows boundlessly on average, yet the complexity of the set of all integers is small – and actually smaller than some integers we can easily name. The whole can be simpler than its parts.)

In the end, we are trading Occam’s razor for a more robust tool: Bayes’ Broadsword. It might require far more strength (computing power/human interaction) to wield, but it has longer reach. And it hits hard.

Appendix: individual classifiers

I used Matlab to make the illustration of the ensemble classification. Here are some of the component classifiers. They are all based on the examples in the Matlab documentation. My ensemble classifier is merely a maximum vote between the component classifiers that assign a class to each point.

Iris data classified using a naive Bayesian classifier assuming Gaussian distributions.
Iris data classified using a naive Bayesian classifier assuming Gaussian distributions.
Iris data classified using a decision tree.
Iris data classified using a decision tree.
Iris data classified using Gaussian kernels.
Iris data classified using Gaussian kernels.
Iris data classified using linear discriminant analysis.
Iris data classified using linear discriminant analysis.