Some math for Epiphany

Analytic functions helping you out

Recently I chatted with a mathematician friend about generating functions in combinatorics. Normally they are treated as a neat symbolic trick: you have a sequence $a_n$ (typically how many there are of some kind of object of size $n$ ), you formally define a function $f(z)=\sum_{n=0}^\infty a_n z^n$ , you derive some constraints on the function, and from this you get a formula for the $a_n$ or other useful data. Convergence does not matter, since this is purely symbolic. We used this in our paper counting tie knots. It is a delightful way of solving recurrence relations or bundle up moments of probability distributions.

I innocently wondered if the function (especially its zeroes and poles) held any interesting information. My friend told me that there was analytic combinatorics: you can actually take $f(z)$ seriously as a (complex) function and use the powerful machinery of complex analysis to calculate asymptotic behavior for the $a_n$ from the location and type of the “dominant” singularities. He pointed me at the excellent course notes from a course at Princeton linked to the textbook by Philippe Flajolet and Robert Sedgewick. They show a procedure for taking combinatorial objects, converting them symbolically into generating functions, and then get their asymptotic behavior from the properties of the functions. This is extraordinarily neat, both in terms of efficiency and in linking different branches of math.

Plot of z/(1-z-z^2), the generating function of the Fibonacci numbers. It has poles at (1+sqrt(5))/2 (the dominant pole giving the overall asymptotic growth of Fibonacci numbers) and (1-sqrt(5))/2, which does not contribute much to the asymptotic behavior.

In our case, one can show nearly by inspection that the number of Fink-Mao tie knots grow with the number of moves as $\sim 2^n$ , while single tuck tie knots grow as $\sim \sqrt{6}^n$ .

Analytic functions behaving badly

The second piece of math I found this weekend was about random Taylor series and lacunary functions.

If $f(z)=\sum_{n=0}^\infty X_n z^n$ where $X_n$ are independent random numbers, what kind of functions do we get? Trying it with complex Gaussian $X_n$ produces a disk of convergence with some nondescript function on the inside.

Plot of function with a Gaussian Taylor series. Color corresponds to stereographic mapping of the complex plane to a sphere, with infinity being white and zeros black. The domain of convergence is the unit circle.

Replacing the complex Gaussian with a real one, or uniform random numbers, or even power-law numbers gives the same behavior. They all seem to have radius 1. This is not just a vanilla disk of convergence (where an analytic function reaches a pole or singularity somewhere on the boundary but is otherwise fine and continuable), but a natural boundary – that is, a boundary so dense with poles or singularities that continuation beyond it is not possible at all.

The locus classicus about random Taylor series is apparently Kahane, J.-P. (1985), Some Random Series of Functions. 2nd ed., Cambridge University Press, Cambridge.

A naive handwave argument is that for $|z|<1$ we have an exponentially decaying sequence of $z^n$ , so if the $X_n$ have some finite average size $E(X)$ and not too divergent variance we should expect convergence, while outside the unit circle any nonzero $E(X)$ will allow it to diverge. We can even invoke the Markov inequality $P(X>t) \leq E(X)/t$ to argue that a series $\sum X_n f(n)$ would converge if $\sum f(n)/n$ converges. However, this is not correct enough for proper mathematics. One entirely possible Gaussian outcome is $X_n=1,10,100,1000,\ldots$ or worse. We need to speak of probabilistic convergence.

Andrés E. Caicedo has a good post about how to approach it properly. The “trick” is the awesome Kolmogorov zero-one law that implies that since the radius of convergence depends on the entire series X_n rather than any finite subset (and they are all independent) it will be a constant.

This kind of natural boundary disk of convergence may look odd to beginning students of complex analysis: after all, none of the functions we normally encounter behave like this. Except that this is of course selection bias. If you look at the example series for lacunary functions they all look like fairly reasonable sparse Taylor series like $z+z^4+z^8+z^16+^32+\lddots$. In calculus we are used to worrying that the coefficients in front of the z-terms of a series don’t diminish fast enough: having fewer nonzero terms seems entirely innocuous. But as Hadamard showed, it is enough that the size of the gaps grow geometrically for the function to get a natural boundary (in fact, even denser series do this – for example having just prime powers). The same is true for Fourier series. Weierstrass’ famous continuous but nowhere differentiable function is lacunary (in his 1880 paper on analytic continuation he gives the example $\sum a_n z^{b^n}$ of an uncontinuable function). In fact, as Emile Borel found and Steinhardt eventually proved in a stricter sense, in general (“almost surely”) a Taylor series isn’t continuable because of boundaries.

The function sum_p z^p, where p runs over the primes. — The function [latex]sum_p z^p[/latex], where [latex]p[/latex] runs over the primes.

One could of course try to combine the analytic combinatorics with the lacunary stuff. In a sense a lacunary generating function is a worst case scenario for the singularity-measuring methods used in analytical combinatorics since you get an infinite number of them at a finite and equal distance, and now have to average them together somehow. Intuitively this case seems to correspond to counting something that becomes rarer at a geometric rate or faster. But the Borel-Steinhardt results suggest that even objects that do not become rare could have nasty natural boundaries – if the number $a_n$ were due to something close enough to random we should expect estimating asymptotics to be hard. The funniest example I can think of is the number of roots of Chaitin-style Diophantine equations where for each $n$ it is an independent arithmetic fact whether there are any: this is hardcore random, and presumably the exact asymptotic growth rate will be uncomputable both practically and theoretically.

Big picture thinking

In Michaelmas term 2015 we ran a seminar series on Big Picture Thinking at FHI. The videos of most seminars are online.

I gave a talk on observer selection effects, and here are my somewhat overdone lecture notes. Covers selection bias, anthropic reasoning, anthropic shadows, nuclear war near misses, physics disasters, the Doomsday Argument, the Fermi Paradox, the Simulation Argument, fine tuning and Boltzmann brains.

Starkiller base versus the ideal gas law

My friend Stuart explains why the Death Stars and the Starkiller Base in the Star Wars universe are inefficient ways of taking over the galaxy. I generally agree: even a super-inefficient robot army will win if you simply bury enemy planets in robots.

But thinking about the physics of absurd superweapons is fun and warms the heart.

The ideal gas law: how do you compress stars?

My biggest problem with the Starkiller Base is the ideal gas law. The weapon works by sucking up a star and then beaming its energy or plasma at remote targets. A sun-like star has a volume around 1.4*10¹⁸ cubic kilometres, while an Earthlike planet has a volume around 10¹² cubic kilometres. So if you suck up a star it will get compressed by a factor of 1.4 million times. The ideal gas law states that pressure times volume equals temperature times the number of particles and some constant: $PV=nRT$

1.4 million times less volume needs to be balanced somehow: either the pressure P has to go down, the temperature T has to go up, or the number of particles n need to go down.

Pressure reduction seems to be a non-starter, unless the Starkiller base actually contains some kind of alternate dimension where there is no pressure (or an enormous volume).

The second case implies a temperature increase by a factor of a 1.4 million. Remember how hot a bike pump gets when compressing air: this is the same effect. This would heat the photosphere gas to 8.4 billion degrees and the core to 2.2*10¹³ K, 22 TeraKelvin; the average would be somewhere between, on the hotter side. We are talking about temperatures microseconds after the Big Bang, hotter than a supernova: protons and neutrons melt at 0.5–1.2 TK into a quark-gluon plasma. Excellent doomsday weapon material but now containment seems problematic. Even if we have antigravity forcefields to hold the star, the black-body radiation is beyond the supernova range. Keeping it inside a planet would be tough: the amount of neutrino radiation would likely blow up the surface like a supernova bounce does.

Maybe the extra energy is bled off somehow? That might be a way to merely get super-hot plasma rather than something evaporating the system. Maybe those pesky neutrinos can be shunted into hyperspace, taking most of the heat with them (neutrino cooling can be surprisingly fast for very hot objects; at these absurd temperatures it is likely subsecond down to mere supernova temperatures).

Another bizarre and fun approach is to reduce the number of gas particles: simply fuse them all into a single nucleus. A neutron star is in a sense a single atomic nucleus. As a bonus, the star would now be a tiny multikilometre sphere held together by its own gravity. If n is reduced by a factor of 10⁵⁷ it could outweigh the compression temperature boost. There would be heating from all the fusion; my guesstimate is that it is about a percent of the mass energy, or 2.7*10⁴⁵ J. This would heat the initial gas to around 96 billion degrees, still manageable by the dramatic particle number reduction. This approach still would involve handling massive neutrino emissions, since the neutronium would still be pretty hot.

In this case the star would remain gravitationally bound into a small blob: convenient as a bullet. Maybe the red “beam” is actually just an accelerated neutron star, leaking mass along its trajectory. The actual colour would of course be more like blinding white with a peak in the gamma ray spectrum. Given the intense magnetic fields locked into neutron stars, moving them electromagnetically looks pretty feasible… assuming you have something on the other end of the electromagnetic field that is heavier or more robust. If a planet shoots a star-mass bullet at a high velocity, then we should expect the recoil to send the planet moving at about a million times faster in the opposite direction.

Other issues

We have also ignored gravity: putting a sun-mass inside an Earth-radius means we get 333,000 times higher gravity. We can try to hand-wave this by arguing that the antigravity used to control the star eating also compensates for the extra gravity. But even a minor glitch in the field would produce an instant, dramatic squishing. Messing up the system* containing the star would not produce conveniently dramatic earthquakes and rifts, but rather near-instant compression into degenerate matter.

(* System – singular. Wow. After two disasters due to single-point catastrophic failures one would imagine designers learning their lesson. Three times is enemy action: if I were the Supreme Leader I would seriously check if the lead designer happens to be named Skywalker.)

There is also the issue of the amount of energy needed to run the base. Sucking up a star from a distance requires supplying the material with the gravitational binding energy of the star, 6.87*10⁴¹ J for the sun. Doing this over an hour or so is a pretty impressive power, about 1.9*10³⁸ W. This is about 486 billion times the solar luminosity. In fact, just beaming that power at a target using any part of the electromagnetic spectrum would fry just about anything.

Of course, a device that can suck up a star ought to be able to suck up planets a million times faster. So there is no real need to go for stars: just suck up the Republic. Since the base can suck up space fleets too, local defences are not much of a problem. Yes, you may have to go there with your base, but if the Death Star can move, the Starkiller can too. If nothing else, it could use its beam to propel itself.

If the First Order want me to consult on their next (undoubtedly even more ambitious) project I am open for offers. However, one iron-clad condition given recent history is that I get to work from home, as far away as possible from the superweapon. Ideally in a galaxy far, far away.

Did amphetamines help Erdős?

During my work on the Paris talk I began to wonder whether Paul Erdős (who I used as an example of a respected academic who used cognitive enhancers) could actually have been shown to have benefited from his amphetamine use, which began in 1971 according to Hill (2004). One way of investigating is his publication record: how many papers did he produce per year before or after 1971? Here is a plot, based on Jerrold Grossman’s 2010 bibliography:

Productivity of Paul Erdos over his life. Green dashed line: amphetamine use, red dashed line: death. Crosses mark named concepts.

The green dashed line is the start of amphetamine use, and the red dashed life is the date of death. Yes, there is a fairly significant posthumous tail: old mathematicians never die, they just asymptote towards zero. Overall, the later part is more productive per year than the early part (before 1971 the mean and standard deviation was 14.6±7.5, after 24.4±16.1; a Kruskal-Wallis test rejects that they are the same distribution, p=2.2e-10).

This does not prove anything. After all, his academic network was growing and he moved from topic to topic, so we cannot prove any causal effect of the amphetamine: for all we know, it might have been holding him back.

One possible argument might be that he did not do his best work on amphetamine. To check this, I took the Wikipedia article that lists things named after Erd ő s, and tried to find years for the discovery/conjecture. These are marked with red crosses in the diagram, slightly jittered. We can see a few clusters that may correspond to creative periods: one in 35-41, one in 46-51, one in 56-60. After 1970 the distribution was more even and sparse. 76% of the most famous results were done before 1971; given that this is 60% of the entire career it does not look that unlikely to be due to chance (a binomial test gives p=0.06).

Again this does not prove anything. Maybe mathematics really is a young man’s game, and we should expect key results early. There may also have been more time to recognize and name results from the earlier career.

In the end, this is merely a statistical anecdote. It does show that one can be a productive, well-renowned (if eccentric) academic while on enhancers for a long time. But given the N=1, firm conclusions or advice are hard to draw.

Erdős’s friends worried about his drug use, and in 1979 Graham bet Erdős $500 that he couldn’t stop taking amphetamines for a month. Erdős accepted, and went cold turkey for a complete month. Erdős’s comment at the end of the month was “You’ve showed me I’m not an addict. But I didn’t get any work done. I’d get up in the morning and stare at a blank piece of paper. I’d have no ideas, just like an ordinary person. You’ve set mathematics back a month.” He then immediately started taking amphetamines again. (Hill 2004)

Limits of morphological freedom

My talk “Morphological freedom: what are the limits to transforming the body?” was essentially a continuation of my original morphological freedom talk from 2001. Now with added philosophical understanding and linking to some of the responses to the original paper. Here is a quick summary:

Enhancement and extensions

I began by a few cases: Liz Parrish self-experimenting with gene therapy to slow her ageing, Paul Erdös using drugs for cognitive enhancement, Todd Huffman exploring the realm of magnetic vision using an implanted magnet, Neil Harbisson gaining access to the realm of color using sonification, Stelarc doing body modification and extension as performance art, and Erik “The Lizardman” Sprague transforming into a lizard as an existential project.

It is worth noting that several of these are not typical enhancements amplifying an existing ability, but about gaining access to entirely new abilities (I call it “extension”). Their value is not instrumental, but lies in the exploration or self-transformation. They are by their nature subjective and divergent. While some argue enhancement will by their nature be convergent (I disagree) extensions definitely go in all directions – and in fact gain importance from being different.

Morphological freedom and its grounding

Morphological freedom, “The right to modify one’s body (or not modify) according to one’s desires”, can be derived from fundamental rights such as the right to life and the right to pursue happiness. If you are not free to control your own body, your right to life and freedom are vulnerable and contingent: hence you need to be allowed to control your body. But I argue this includes a right to change the body: morphological freedom.

One can argue about what rights are, or if they exist. If there are such things, there is however a fair consensus that life and liberty is on the list. Similarly, morphological freedom seems to be so intrinsically tied together with personhood that it becomes inalienable: you cannot remove it from a person without removing an important aspect of what it means to be a person.

These arguments are about fundamental rights rather than civil and legal rights: while I think we should make morphological freedom legally protected, I do think there is more to it than just mutual agreement. Patrick Hopkins wrote an excellent paper analysing how morphological freedom could be grounded. He argued that there are three primary approaches: grounding it in individual autonomy, in human nature, or in human interests. Autonomy is very popular, but Hopkins thinks much of current discourse is a juvenile “I want to be allowed to do what I want” autonomy rather than the more rational or practical concepts of autonomy in deontological or consequentialist ethics. One pay-off is that these concepts do imply limits to morphological freedom to undermine one’s own autonomy. Grounding in human nature requires a view of human nature. Transhumanists and many bioconservatives actually find themselves allies against the relativists and constructivists that deny any nature: they merely disagree on what the sacrosanct parts of that nature are (and these define limits of morphological freedom). Transhumanists think most proposed enhancements are outside these parts, the conservatives think they cover nearly any enhancement. Finally, grounding in what makes humans truly flourish again produces some ethically relevant limits. However, the interest account has trouble with extensions: at best it can argue that we need exploration or curiosity.

One can motivate morphological freedom in many other ways. One is that we need to explore: both because there may be posthuman modes of existence of extremely high value, and because we cannot know the value of different changes without trying them – the world is too complex to be reliably predicted, and many valuable things are subjective in nature. One can also argue we have some form of duty to approach posthumanity, because this approach is intrinsically or instrumentally important (consider a transhumanist reading of Nietzsche, or cosmist ideas). This approach typically seem to require some non-person affecting value. Another approach is to argue morphological freedom is socially constructed within different domains; we have one kind of freedom in sport, another one in academia. I am not fond of this approach since it does not explain how to handle the creation of new domains or what to do between domains. Finally, there is the virtue approach: self-transformation can be seen as a virtue. By this perspective we are not only allowed to change ourselves, we ought to since it is part of human excellence and authenticity.

Limits

Limits to morphological freedom can be roughly categorized as practical/prudential, issues of willingness to change/identity, the ethical limits, and the social limits.

Practical/prudential limits

Safety is clearly a constraint. If an enhancement is too dangerous, then the risk outweighs the benefit and it should not be done. This is tricky to evaluate for more subjective benefits. The real risk boundary might not be a risk/benefit trade-off, but whether risk is handled in a responsible manner. The difference between being a grinder and doing self-harm consists in whether one is taking precautions and regard pain and harms as problems rather than the point of the exercise.

There are also obvious technological and biological limits. I did not have the time to discuss them, but I think one can use heuristics like the evolutionary optimality challenge to make judgements about feasibility and safety.

Identity limits

Even in a world where anything could be changed with no risk, economic cost or outside influence it is likely that many traits would remain stable. We express ourselves through what we transform ourselves into, and this implies that we will not change what we consider to be prior to that. The Riis, Simmons and Goodwin study showed that surveyed students were much less willing to enhance traits that were regarded more relevant to personal identity than peripheral traits. Rather than “becoming more than you are” the surveyed students were interested in being who they are – but better at it. Morphological freedom may hence be strongly constrained by the desire to maintain a variant of the present self.

Ethical limits

Beside the limits coming from the groundings discussed above, there are the standard constraints of not harming or otherwise infringing on the rights of others, capacity (what do we do about prepersons, children or the deranged?) and informed consent. The problem here is not any disagreement about the existence of the constraints, but where they actually lie and how they actually play out.

Social limits

There are obvious practical social limits for some morphological freedom. Becoming a lizard affects your career choices and how people react to you – the fact that maybe it shouldn’t does not change the fact that it does.

There are also constraints from externalities: morphological freedom should not unduly externalize its costs on the rest of society.

My original paper has got a certain amount of flak from the direction of disability rights, since I argued morphological freedom is a negative right. You have a right to try to change yourself, but I do not need to help you – and vice versa. The criticism is that this is ableist: to be a true right there must be social support for achieving the inherent freedom. To some extent my libertarian leanings made me favour a negative right, but it was also the less radical choice: I am actually delighted that others think we need to reshape society to help people self-transform, a far more radical view. I have some misgivings about the politics of this, prioritization tends to be nasty business, it means that costs will be socially externalized, and in the literature there seem to be some odd views about who gets to say what bodies are authentic or not, but I am all in favour of a “commitment to the value, standing, and social legibility of the widest possible (and an ever-expanding) variety of desired morphologies and lifeways.”

Another interesting discourse has been about the control of the body. While in medicine there has been much work to normalize it (slowly shifting towards achieving functioning in one’s own life), in science the growth of ethics review has put more and more control in the hands of appointed experts, while in performance art almost anything goes (and attempts to control it would be censorship). As Goodall pointed out, many of the body-oriented art pieces are as much experiments in ethics as they are artistic experiments. They push the boundaries in important ways.

Touch the limits

In the end, I think this is an important realization: we do not fully know the moral limits of morphological freedom. We should not expect all of them to be knowable through prior reasoning. This is a domain where much is unknown and hard for humans to reason about. Hence we need experiments and exploration to learn them. We should support this exploration since there is much of value to be found, and because it embodies much of what humanity is about. Even when we do not know it yet.

Being reasonable

The ever readable Scott Alexander stimulated a post on Practical Ethics about defaults, status quo, and disagreements about sex. The quick of it: our culture sets defaults on who is reasonable or unreasonable when couples disagree, and these become particularly troubling when dealing with biomedical enhancements of love and sex. The defaults combine with status quo bias and our scepticism for biomedical interventions to cause biases that can block or push people towards certain interventions.

Packing my circles

One of the first fractals I ever saw was the Apollonian gasket, the shape that emerges if you draw the circle internally tangent to three other tangent circles. It is somewhat similar to the Sierpinski triangle, but has a more organic flair. I can still remember opening my copy of Mandelbrot’s The Fractal Geometry of Nature and encountering this amazing shape. There is a lot of interesting things going on here.

Here is a simple algorithm for generating related circle packings, trading recursion for flexibility:

Start with a domain and calculate the distance to the border for all interior points.
Place a circle of radius $\alpha d^*$ at the point with maximal distance $d^*=\max d(x,y)$ from the border.
Recalculate the distances, treating the new circle as a part of the border.
Repeat (2-3) until the radius becomes smaller than some tolerance.

This is easily implemented in Matlab if we discretize the domain and use an array of distances $d(x,y)$ , which is then updated $d(x,y) \leftarrow \min(d(x,y), D(x,y))$ where $D(x,y)$ is the distance to the circle. This trades exactness for some discretization error, but it can easily handle nearly arbitrary shapes.

It is interesting to note that the topology is Apollonian nearly everywhere: as soon as three circles form a curvilinear triangle the interior will be a standard gasket if $\alpha=1$ .

Number of circles larger than a certain radius in packing in blob shape.

In the above pictures the first circle tends to dominate. In fact, the size distribution of circles is a power law: the number of circles larger than r grows as $N(r)\propto r^-\delta$ as we approach zero, with $\delta \approx 1.3$ . This is unsurprising: given a generic curved triangle, the inscribed circle will be a fraction of the radii of the bordering circles. If one looks at integral circle packings it is possible to see that the curvatures of subsequent circles grow quadratically along each “horn”, but different “horns” have different growths. Because of the curvature the self-similarity is nontrivial: there is actually, as far as I know, still no analytic expression of the fractal dimension of the gasket. Still, one can show that the packing exponent $\delta$ is the Hausdorff dimension of the gasket.

Anyway, to make the first circle less dominant we can either place a non-optimal circle somewhere, or use lower $\alpha$ .

Apollonian packing in square with central circle of radius 1/6.

If we place a circle in the centre of a square with a radius smaller than the distance to the edge, it gets surrounded by larger circles.

If the circle is misaligned, it is no problem for the tiling: any discrepancy can be filled with sufficiently small circles. There is however room for arbitrariness: when a bow-tie-shaped region shows up there are often two possible ways of placing a maximal circle in it, and whichever gets selected breaks the symmetry, typically producing more arbitrary bow-ties. For “neat” arrangements with the right relationships between circle curvatures and positions this does not happen (they have circle chains corresponding to various integer curvature relationships), but the generic case is a mess. If we move the seed circle around, the rest of the arrangement both show random jitter and occasional large-scale reorganizations.

When we let $\alpha<1$ we get sponge-like fractals: these are relatives to the Menger sponge and the Cantor set. The domain gets an infinity of circles punched out of itself, with a total area approaching the area of the domain, so the total measure goes to zero.

That these images have an organic look is not surprising. Vascular systems likely grow by finding the locations furthest away from existing vascularization, then filling in the gaps recursively (OK, things are a bit more complex).

How small is the wiki?

Recently I encountered a specialist Wiki. I pressed “random page” a few times, and got a repeat page after 5 tries. How many pages should I expect this small wiki to have?

We can compare this to the German tank problem. Note that it is different; in the tank problem we have a maximum sample (maybe like the web pages on the site were numbered), while here we have number of samples before repetition.

We can of course use Bayes theorem for this. If I get a repeat after $k$ random samples, the posterior distribution of $N$ , the number of pages, is $P(N|k) = P(k|N)P(N)/P(k)$ .

If I randomly sample from $N$ pages, the probability of getting a repeat on my second try is $1/N$ , on my third try $2/N$ , and so on: $P(k|N)=(k-1)/N$ . Of course, there has to be more pages than $k-1$ , otherwise a repeat must have happened before step $k$ , so this is valid for $k \leq N+1$ . Otherwise, $P(k|N)=0$ for $k>N+1$ .

The prior $P(N)$ needs to be decided. One approach is to assume that websites have a power-law distributed number of pages. The majority are tiny, and then there are huge ones like Wikipedia; the exponent is close to 1. This gives us $P(N) = N^{-\alpha}/\zeta(\alpha)$ . Note the appearance of the Riemann zeta function as a normalisation factor.

We can calculate $P(k)$ by summing over the different possible $N$ : $P(k)=\sum_{N=1}^\infty P(k|N)P(N) = \frac{k-1}{\zeta(\alpha)}\sum_{N=k-1}^\infty N^{-(\alpha+1)}$ $=\frac{k-1}{\zeta(\alpha)}(\zeta(\alpha+1)-\sum_{i=1}^{k-2}i^{-(\alpha+1)})$ .

Putting it all together we get $P(N|k)=N^{-(\alpha+1)}/(\zeta(\alpha+1) -\sum_{i=1}^{k-2}i^{-(\alpha+1)})$ for $N\geq k-1$ . The posterior distribution of number of pages is another power-law. Note that the dependency on $k$ is rather subtle: it is in the support of the distribution, and the upper limit of the partial sum.

What about the expected number of pages in the wiki? $E(N|k)=\sum_{N=1}^\infty N P(N|k) = \sum_{N=k-1}^\infty N^{-\alpha}/(\zeta(\alpha+1) -\sum_{i=1}^{k-2}i^{-(\alpha+1)})$ $=\frac{\zeta(\alpha)-\sum_{i=1}^{k-2} i^{-\alpha}}{\zeta(\alpha+1)-\sum_{i=1}^{k-2}i^{-(\alpha+1)}}$ . The expectation is the ratio of the zeta functions of $\alpha$ and $\alpha+1$ , minus the first $k-2$ terms of their series.

Distribution of P(N|5) for alpha=1.1. — Distribution of P(N|5) for [latex]\alpha=1.1[/latex].

So, what does this tell us about the wiki I started with? Assuming $\alpha=1.1$ (close to the behavior of big websites), it predicts $E(N|k)\approx 21.28$ . If one assumes a higher $\alpha=2$ the number of pages would be 7 (which was close to the size of the wiki when I looked at it last night – it has grown enough today for k to equal 13 when I tried it today).

Expected number of pages given k random views before a repeat.

So, can we derive a useful rule of thumb for the expected number of pages? Dividing by $k$ shows that $E(N|k)$ approaches proportionality, especially for larger $\alpha$ :

<img src='http://s0.wp.com/latex.php?latex=E%28N%7Ck%29%2Fk&bg=ffffff&fg=000000&s=0' alt='E(N|k)/k' title='E(N|k)/k' class='latex' /> as a function of k. — E(N|k)/k as a function of k.

So a good rule of thumb is that if you get $k$ pages before a repeat, expect between $2k$ and $4k$ pages on the site. However, remember that we are dealing with power-laws, so the variance can be surprisingly high.

Bayes’ Broadsword

Yesterday I gave a talk at the joint Bloomberg-London Futurist meeting “The state of the future” about the future of decisionmaking. Parts were updates on my policymaking 2.0 talk (turned into this chapter), but I added a bit more about individual decisionmaking, rationality and forecasting.

The big idea of the talk: ensemble methods really work in a lot of cases. Not always, not perfectly, but they should be among the first tools to consider when trying to make a robust forecast or decision. They are Bayes’ broadsword:

Forecasting

One of my favourite experts on forecasting is J Scott Armstrong. He has stressed the importance of evidence based forecasting, including checking how well different methods work. The general answer is: not very well, yet people keep on using them. He has been pointing this out since the 70s. It also turns out that expertise only gets you so far: expert forecasts are not very reliable either, and the accuracy levels out quickly with increasing level of expertise. One implication is that one should at least get cheap experts since they are about as good as the pricey ones. It is also known that simple models for forecasting tends to be more accurate than complex ones, especially in complex and uncertain situations (see also Haldane’s “The Dog and the Frisbee”). Another important insight is that it is often better to combine different methods than try to select the one best method.

Another classic look at prediction accuracy is Philip Tetlock’s Expert Political Judgment (2005) where he looked at policy expert predictions. They were only slightly more accurate than chance, worse than basic extrapolation algorithms, and there was a negative link to fame: high profile experts have an incentive to be interesting and dramatic, but not right. However, he noticed some difference between “hedgehogs” (people with One Big Theory) and “foxes” (people using multiple theories), with the foxes outperforming hedgehogs.

OK, so in forecasting it looks like using multiple methods, theories and data sources (including experts) is a way to get better results.

Statistical machine learning

A standard problem in machine learning is to classify something into the right category from data, given a set of training examples. For example, given medical data such as age, sex, and blood test results, diagnose what a particular disease a patient might suffer from. The key problem is that it is non-trivial to construct a classifier that works well on data different from the training data. It can work badly on new data, even if it works perfectly on the training examples. Two classifiers that perform equally well during training may perform very differently in real life, or even for different data.

The obvious solution is to combine several classifiers and average (or vote about) their decisions: ensemble based systems. This reduces the risk of making a poor choice, and can in fact improve overall performance if they can specialize for different parts of the data. This also has other advantages: very large datasets can be split into manageable chunks that are used to train different components of the ensemble, tiny datasets can be “stretched” by random resampling to make an ensemble trained on subsets, outliers can be managed by “specialists”, in data fusion different types of data can be combined, and so on. Multiple weak classifiers can be combined into a strong classifier this way.

The method benefits from having diverse classifiers that are combined: if they are too similar in their judgements, there is no advantage. Estimating the right weights to give to them is also important, otherwise a truly bad classifier may influence the output.

Iris data classified using an ensemble of classification methods. — Iris data classified using an ensemble of classification methods (LDA, NBC, various kernels, decision tree). Note how the combination of classifiers also roughly indicates the overall reliability of classifications in a region.

The iconic demonstration of the power of this approach was the Netflix Prize, where different teams competed to make algorithms that predicted user ratings of films from previous ratings. As part of the rules the algorithms were made public, spurring innovation. When the competition concluded in 2009, the leading teams all consisted of ensemble methods where component algorithms were from past teams. The two big lessons were (1) that a combination of not just the best algorithms, but also less accurate algorithms, were the key to winning, and (2) that organic organization allows the emergence of far better performance than having strictly isolated teams.

Group cognition

Condorcet’s jury theorem is perhaps the classic result in group problem solving: if a group of people hold a majority vote, and each has a probability p>1/2 of voting for the correct choice, then the probability the group will vote correctly is higher than p and will tend to approach 1 as the size of the group increases. This presupposes that votes are independent, although stronger forms of the theorem have been proven. (In reality people may have different preferences so there is no clear “right answer”)

Probability that groups of different sizes will reach the correct decision as a function of the individual probability of voting right.

By now the pattern is likely pretty obvious. Weak decision-makers (the voters) are combined through a simple procedure (the vote) into better decision-makers.

Group problem solving is known to be pretty good at smoothing out individual biases and errors. In The Wisdom of Crowds Surowiecki suggests that the ideal crowd for answering a question in a distributed fashion has diversity of opinion, independence (each member has an opinion not determined by the other’s), decentralization (members can draw conclusions based on local knowledge), and the existence of a good aggregation process turning private judgements into a collective decision or answer.

Perhaps the grandest example of group problem solving is the scientific process, where peer review, replication, cumulative arguments, and other tools make error-prone and biased scientists produce a body of findings that over time robustly (if sometimes slowly) tends towards truth. This is anything but independent: sometimes a clever structure can improve performance. However, it can also induce all sorts of nontrivial pathologies – just consider the detrimental effects status games have on accuracy or focus on the important topics in science.

Small group problem solving on the other hand is known to be great for verifiable solutions (everybody can see that a proposal solves the problem), but unfortunately suffers when dealing with “wicked problems” lacking good problem or solution formulation. Groups also have scaling issues: a team of N people need to transmit information between all N(N-1)/2 pairs, which quickly becomes cumbersome.

One way of fixing these problems is using software and formal methods.

The Good Judgement Project (partially run by Tetlock and with Armstrong on the board of advisers) participated in the IARPA ACE program to try to improve intelligence forecasts. They used volunteers and checked their forecast accuracy (not just if they got things right, but if claims that something was 75% likely actually came true 75% of the time). This led to a plethora of fascinating results. First, accuracy scores based on the first 25 questions in the tournament predicted subsequent accuracy well: some people were consistently better than others, and it tended to remain constant. Training (such a debiasing techniques) and forming teams also improved performance. Most impressively, using the top 2% “superforecasters” in teams really outperformed the other variants. The superforecasters were a diverse group, smart but by no means geniuses, updating their beliefs frequently but in small steps.

The key to this success was that a computer- and statistics-aided process found the good forecasters and harnessed them properly (plus, the forecasts were on a shorter time horizon than the policy ones Tetlock analysed in his previous book: this both enables better forecasting, plus the all-important feedback on whether they worked).

Another good example is the Galaxy Zoo, an early crowd-sourcing project in galaxy classification (which in turn led to the Zooniverse citizen science project). It is not just that participants can act as weak classifiers and combined through a majority vote to become reliable classifiers of galaxy type. Since the type of some galaxies is agreed on by domain experts they can used to test the reliability of participants, producing better weightings. But it is possible to go further, and classify the biases of participants to create combinations that maximize the benefit, for example by using overly “trigger happy” participants to find possible rare things of interest, and then check them using both conservative and neutral participants to become certain. Even better, this can be done dynamically as people slowly gain skill or change preferences.

The right kind of software and on-line “institutions” can shape people’s behavior so that they form more effective joint cognition than they ever could individually.

Conclusions

The big idea here is that it does not matter that individual experts, forecasting methods, classifiers or team members are fallible or biased, if their contributions can be combined in such a way that the overall output is robust and less biased. Ensemble methods are examples of this.

While just voting or weighing everybody equally is a decent start, performance can be significantly improved by linking it to how well the participants perform. Humans can easily be motivated by scoring (but look out for disalignment of incentives: the score must accurately reflect real performance and must not be gameable).

In any case, actual performance must be measured. If we cannot tell if some method is more accurate than something else, then either accuracy does not matter (because it cannot be distinguished or we do not really care), or we will not get the necessary feedback to improve it. It is known from the expertise literature that one of the key factors for it to be possible to become an expert on a task is feedback.

Having a flexible structure that can change is a good approach to handling a changing world. If people have disincentives to change their mind or change teams, they will not update beliefs accurately.

I got a good question after the talk: if we are supposed to keep our models simple, how can we use these complicated ensembles? The answer is of course that there is a difference between using a complex and a complicated approach. The methods that tend to be fragile are the ones with too many free parameters, too much theoretical burden: they are the complex “hedgehogs”. But stringing together a lot of methods and weighting them appropriately merely produces a complicated model, a “fox”. Component hedgehogs are fine as long as they are weighed according to how well they actually perform.

(In fact, adding together many complex things can make the whole simpler. My favourite example is the fact that the Kolmogorov complexity of integers grows boundlessly on average, yet the complexity of the set of all integers is small – and actually smaller than some integers we can easily name. The whole can be simpler than its parts.)

In the end, we are trading Occam’s razor for a more robust tool: Bayes’ Broadsword. It might require far more strength (computing power/human interaction) to wield, but it has longer reach. And it hits hard.

Appendix: individual classifiers

I used Matlab to make the illustration of the ensemble classification. Here are some of the component classifiers. They are all based on the examples in the Matlab documentation. My ensemble classifier is merely a maximum vote between the component classifiers that assign a class to each point.

Iris data classified using a naive Bayesian classifier assuming Gaussian distributions.

Iris data classified using a decision tree.

Iris data classified using Gaussian kernels.

Iris data classified using linear discriminant analysis.

All models are wrong, some are useful – but how can you tell?

Our whitepaper about the systemic risk of risk modelling is now out. The topic is how the risk modelling process can make things worse – and ways of improving things. Cognitive bias meets model risk and social epistemology.

The basic story is that in insurance (and many other domains) people use statistical models to estimate risk, and then use these estimates plus human insight to come up with prices and decisions. It is well known (at least in insurance) that there is a measure of model risk due to the models not being perfect images of reality; ideally the users will take this into account. However, in reality (1) people tend to be swayed by models, (2) they suffer from various individual and collective cognitive biases making their model usage imperfect and correlates their errors, (3) the markets for models, industrial competition and regulation leads to fewer models being used than there could be. Together this creates a systemic risk: everybody makes correlated mistakes and decisions, which means that when a bad surprise happens – a big exogenous shock like a natural disaster or a burst of hyperinflation, or some endogenous trouble like a reinsurance spiral or financial bubble – the joint risk of a large chunk of the industry failing is much higher than it would have been if everybody had had independent, uncorrelated models. Cue bailouts or skyscrapers for sale.

Note that this is a generic problem. Insurance is just unusually self-aware about its limitations (a side effect of convincing everybody else that Bad Things Happen, not to mention seeing the rest of the financial industry running into major trouble). When we use models the model itself (the statistics and software) is just one part: the data fed into the model, the processes of building and tuning the model, how people use it in their everyday work, how the output leads to decisions, and how the eventual outcomes become feedback to the people involved – all of these factors are important parts in making model use useful. If there is no or too slow feedback people will not learn what behaviours are correct or not. If there are weak incentives to check errors of one type, but strong incentives for other errors, expect the system to become biased towards one side. It applies to climate models and military war-games too.

The key thing is to recognize that model usefulness is not something that is directly apparent: it requires a fair bit of expertise to evaluate, and that expertise is also not trivial to recognize or gain. We often compare models to other models rather than reality, and a successful career in predicting risk may actually be nothing more than good luck in avoiding rare but disastrous events.

What can we do about it? We suggest a scorecard as a first step: comparing oneself to some ideal modelling process is a good way of noticing where one could find room for improvement. The score does not matter as much as digging into one’s processes and seeing whether they have cruft that needs to be fixed – whether it is following standards mindlessly, employees not speaking up, basing decisions on single models rather than more broad views of risk, or having regulators push one into the same direction as everybody else. Fixing it may of course be tricky: just telling people to be less biased or to do extra error checking will not work, it has to be integrated into the organisation. But recognizing that there may be a problem and getting people on board is a great start.

In the end, systemic risk is everybody’s problem.

Andart II

Part of Anders' Exoself

Author: admin