# How small is the wiki?

Recently I encountered a specialist Wiki. I pressed “random page” a few times, and got a repeat page after 5 tries. How many pages should I expect this small wiki to have?

We can compare this to the German tank problem. Note that it is different; in the tank problem we have a maximum sample (maybe like the web pages on the site were numbered), while here we have number of samples before repetition.

We can of course use Bayes theorem for this. If I get a repeat after $k$ random samples, the posterior distribution of $N$, the number of pages, is $P(N|k) = P(k|N)P(N)/P(k)$.

If I randomly sample from $N$ pages, the probability of getting a repeat on my second try is $1/N$, on my third try $2/N$, and so on: $P(k|N)=(k-1)/N$. Of course, there has to be more pages than $k-1$, otherwise a repeat must have happened before step $k$, so this is valid for $k \leq N+1$. Otherwise, $P(k|N)=0$ for $k>N+1$.

The prior $P(N)$ needs to be decided. One approach is to assume that websites have a power-law distributed number of pages. The majority are tiny, and then there are huge ones like Wikipedia; the exponent is close to 1. This gives us $P(N) = N^{-\alpha}/\zeta(\alpha)$. Note the appearance of the Riemann zeta function as a normalisation factor.

We can calculate $P(k)$ by summing over the different possible $N$: $P(k)=\sum_{N=1}^\infty P(k|N)P(N) = \frac{k-1}{\zeta(\alpha)}\sum_{N=k-1}^\infty N^{-(\alpha+1)}$ $=\frac{k-1}{\zeta(\alpha)}(\zeta(\alpha+1)-\sum_{i=1}^{k-2}i^{-(\alpha+1)})$.

Putting it all together we get $P(N|k)=N^{-(\alpha+1)}/(\zeta(\alpha+1) -\sum_{i=1}^{k-2}i^{-(\alpha+1)})$ for $N\geq k-1$. The posterior distribution of number of pages is another power-law. Note that the dependency on $k$ is rather subtle: it is in the support of the distribution, and the upper limit of the partial sum.

What about the expected number of pages in the wiki? $E(N|k)=\sum_{N=1}^\infty N P(N|k) = \sum_{N=k-1}^\infty N^{-\alpha}/(\zeta(\alpha+1) -\sum_{i=1}^{k-2}i^{-(\alpha+1)})$ $=\frac{\zeta(\alpha)-\sum_{i=1}^{k-2} i^{-\alpha}}{\zeta(\alpha+1)-\sum_{i=1}^{k-2}i^{-(\alpha+1)}}$. The expectation is the ratio of the zeta functions of $\alpha$ and $\alpha+1$, minus the first $k-2$ terms of their series.

So, what does this tell us about the wiki I started with? Assuming $\alpha=1.1$ (close to the behavior of big websites), it predicts $E(N|k)\approx 21.28$. If one assumes a higher $\alpha=2$ the number of pages would be 7 (which was close to the size of the wiki when I looked at it last night – it has grown enough today for k to equal 13 when I tried it today).

So, can we derive a useful rule of thumb for the expected number of pages? Dividing by $k$ shows that $E(N|k)$ approaches proportionality, especially for larger $\alpha$:

So a good rule of thumb is that if you get $k$ pages before a repeat, expect between $2k$ and $4k$ pages on the site. However, remember that we are dealing with power-laws, so the variance can be surprisingly high.

# Rational fractal distributions

Most of the time we encounter probability distributions over the reals, the positive reals, or integers. But one can use the rational numbers as a probability space too.

Recently I found the paper Vladimir Trifonov, Laura Pasqualucci, Riccardo Dalla-Favera & Raul Rabadan. Fractal-like Distributions over the Rational Numbers in High-throughput Biological and Clinical Data. Scientific Reports 1:191 DOI: 10.1038/srep00191. They discuss the distribution of ratios of the number of reads from the same spot of DNA that come from each chromosome in a pair: the number of reads is an integer, so the ratio is rational. They get a peaky, self-similar distribution empirically, and the paper explains why.

If you take positive independent integers from some distribution $f(n)$ and generate ratios $q=a/(a+b)$, then those ratios will have a distribution that is a convolution over the rational numbers: $g(q) = g(a/(a+b)) = \sum_{m=0}^\infty \sum_{n=0}^\infty f(m) g(n) \delta \left(\frac{a}{a+b} - \frac{m}{m+n} \right ) = \sum_{t=0}^\infty f(ta)f(tb)$

One can of course do the same for non-independent and different distributions of the integers. Oh, and by the way: this whole thing has little to do with ratio distributions (alias slash distributions), which is what happens in the real case.

The authors found closed form solutions for integers distributed as a power-law with an exponential cut-off and for the uniform distribution; unfortunately the really interesting case, the Poisson distribution, doesn’t seem to have a neat closed form solution.

In the case of a uniform distributions on the set $\{1,2,\ldots , L\}$ they get $g(a/(a+b)) = (1/L^2) \lfloor L/\max(a,b) \rfloor$.

They note that this is similar to Thomae’s function, a somewhat well-known (and multiply named) counterexample in real analysis. That function is defined as f(p/q)=1/q (where the fraction is in lowest terms). In fact, both graphs have the same fractal dimension of 1.5.

It is easy to generate other rational distributions this way. Using a power law as an input produces a sparser pattern, since the integers going into the ratio tend to be small numbers, putting more probability at simple ratios:

If we use exponential distributions the pattern is fairly similar, but we can of course change the exponent to get something that ranges over a lot of numbers, putting more probability at nonsimple ratios $p/q$ where $p+q \gg 1$:

Not everything has to be neat and symmetric. Taking the ratio of two unequal Poisson distributions can produce a rather appealing pattern:

Of course, full generality would include ratios of non-positive numbers. Taking ratios of normal variates rounded to the nearest integer produces a fairly sparse distribution since high numerators or denominators are rare.

But multiplying the variates by 10 produces a nice distribution.

This approaches the Chauchy distribution as the discretisation gets finer. But note the fun microstructure (very visible in the Poisson case above too), where each peak at a simple ratio is surrounded by a “moat” of low probability. This is reminiscent of the behaviour of roots of random polynomials with integer coefficients (see also John Baez page on the topic).

The rational numbers do tend to induce a fractal recursive structure on things, since most measures on them will tend to put more mass at simple ratios than at complex ratios, but when plotting the value of the ratio everything gets neatly folded together. The lower approximability of numbers near the simple ratios produce moats. Which also suggests a question to ponder further: what role does the über-unapproximable golden ratio have in distributions like these?