Shakespeare – Andart II

During a recent party I got asked the question “Since $\pi$ has an infinite decimal expansion, does that mean the collected works of Shakespeare (suitably encoded) are in it somewhere?”

My first response was to point out that infinite decimal expressions are not enough: obviously $1/3=0.33333\ldots$ is a Shakespeare-free number (unless we have a bizarre encoding of the works in the form of all threes). What really matters is whether the number is suitably random. In mathematics this is known as the question about whether pi is a normal number.

If it is normal, then by the infinite monkey theorem then Shakespeare will almost surely be in the number. We actually do not know whether pi is normal, but it looks fairly likely. But that is not enough for a mathematician. A good overview of the problem can be found in a popular article by Bailey and Borwein. (Yep, one of the Borweins)

Where are the Shakespearian numbers?

This led to a second issue: what is the distribution of the Shakespeare-containing numbers?

We can encode Shakespeare in many ways. As an ASCII text the works take up 5.3 MB. One can treat this as a sequence of 7-bit characters and the works as 37,100,000 bits, or 11,168,212 decimal digits. A simple code where each pair of digits encode a character would encode 10,600,000 digits. This allows just a 100 character alphabet rather than a 127 character alphabet, but is likely OK for Shakespeare: we can use the ASCII code minus 32, for example.

If we denote the encoded works of Shakespeare by $[Shakespeare]$ , all numbers of the form $0.[Shakespeare]xxxxx\ldots$ are Shakespeare-containing.

They form a rather tiny interval: since the works start with ‘The’, $[Shakespeare]$ starts as “527269…” and the interval lies inside the interval $[0. 527269000\ldots , 0.52727]$ , a mere millionth of $[0,1]$ . The actual interval is even shorter.

But outside that interval there are numbers of the form $0.y[Shakespeare]xxxx\ldots$ , where $y$ is a digit different from the starting digit of $[Shakespeare]$ and $x$ anything else. So there are 9 such second level intervals, each ten times thinner than the first level interval.

This pattern continues, with the intervals at each level ten times thinner but also 9 times as numerous. This is fairly similar to the Cantor set and gives rise to a fractal. But since the intervals are very tiny it is hard to see.

One way of visualizing this is to assume the weird encoding $[Shakespeare]=3$ , so all numbers containing the digit 3 in the decimal expansion are Shakespearian and the rest are Shakespeare-free.

Distribution of Shakespeare-free numbers in the unit interval, assuming Shakespeare's collected works are encoded as the digit "3". — Distribution of Shakespeare-free numbers in the unit interval, assuming Shakespeare’s collected works are encoded as the digit “3”.

The fractal dimension of this Shakespeare-free set is $\log(9)/\log(10)\approx 0.9542$ . This is less than 1: most points are Shakespearian and in one of the intervals, but since they are thin compared to the line the Shakespeare-free set is nearly one dimensional. Like the Cantor set, each Shakespeare-free number is isolated from any other Shakespeare-free number: there is always some Shakespearian numbers between them.

In the case of the full 5.3MB [Shakespeare] the interval length is around $10^{-10,600,000}$ . The fractal dimension of the Shakespeare-free set is $\log(10^{10,600,000} - 1)/\log(10^{10,600,600}) \approx 1-\epsilon$ , for some tiny $\epsilon \approx 10^{-10,600,000}$ . It is very nearly an unbroken line… except for that nearly every point actually does contain Shakespeare.

We have been looking at the unit interval. We can of course look at the entire real line too, but the pattern is similar: just magnify the unit interval pattern by 10, 100, 1000, … times. Somewhere around $10^{10,600,000}$ there are the numbers that have an integer part equal to $[Shakespeare]$ . And above them are the intervals that start with his works followed by something else, a decimal point and then any decimals. And beyond them there are the $[Shakespeare][Shakespeare]xxx\ldots$ numbers…

Shakespeare is common

One way of seeing that Shakespearian numbers are the generic case is to imagine choosing a number randomly. It has probability $S$ of being in the level 1 interval of Shakespearian numbers. If not, then it will be in one of the 9 intervals 1/10 long that don’t start with the correct first digit, where the probability of starting with Shakespeare in the second digit is $S$ . If that was all there was, the total probability would be $S+(9/10)S+(9/10^2)S+\ldots = 10S<1$ . But the 1/10 interval around the first Shakespearian interval also counts: a number that has the right first digit but wrong second digit can still be Shakespearian. So it will add probability.

Another way of thinking about it is just to look at the initial digits: the probability of starting with $[Shakespeare]$ is $S$ , the probability of starting with $[Shakespeare]$ in position 2 is $(1-S)S$ (the first factor is the probability of not having Shakespeare first), and so on. So the total probability of finding Shakespeare is $S + (1-S)S + (1-S)^2S + (1-S)^3S + \ldots = S/(1-(1-S))=1$ . So nearly all numbers are Shakespearian.

This might seem strange, since any number you are likely to mention is very likely Shakespeare-free. But this is just like the case of transcendental, normal or uncomputable numbers: they are actually the generic case in the reals, but most everyday numbers belong to the algebraic, non-normal and computable numbers.

It is also worth remembering that while all normal numbers are (almost surely) Shakespearian, there are non-normal Shakespearian numbers. For example, the fractional number $0.[Shakespeare]000\ldots$ is non-normal but Shakespearian. So is $0.[Shakespeare][Shakespeare][Shakespeare]\ldots$ We can throw in arbitrary finite sequences of digits between the Shakespeares, biasing numbers as close or far as we want from normality. There is a number $0.[Shakespeare]3141592\ldots$ that has the digits of $\pi$ plus Shakespeare. And there is a number that looks like $\pi$ until Graham’s number digits, then has a single Shakespeare and then continues. Shakespeare can hide anywhere.

In things of great receipt with case we prove,
Among a number one is reckoned none.
Then in the number let me pass untold,
Though in thy store’s account I one must be
-Sonnet 136