And the pedestrians are off! Oh no, that lady is jaywalking!

In 1983 Swedish Television began an all-evening entertainment program named Razzel. It was centred around the state lottery draw, with music, sketch comedy, and television series interspersed between the blocks. Yes, this was back in the day when there was two TV channels to choose from and more or less everybody watched. The ice age had just about ended.

One returning feature consisted of camera footage of a pedestrian crossing in Stockholm. A sports commenter well-known for his coverage of horse-racing narrated the performance of the unknowing pedestrians as if they were competing in a race. In some cases I even think he even showed up to deliver flowers to the “winner”. But you would get disqualified if you had a false start or went outside the stripes!

I suspect this feature noticeably improved traffic safety for a generation.

I was reminded of this childhood memory earlier today when discussing the use of face recognition in China to detect jaywalkers and display them on a billboard to shame them. The typical response in a western audience is fear of what looks like a totalitarian social engineering program. The glee with which many responded to the news that the system had been confused by a bus ad, putting a celebrity on the board of shame, is telling.

Is there a difference?

But compare the Chinese system to the TV program. In the China case the jaywalker may be publicly shamed from the billboard… but in the cheerful 80s TV program they were shamed in front of much of the nation.

There is a degree of increased personalness in the Chinese case since it also displays their name, but no doubt friends and neighbours would recognize you if they saw you on TV (remember, this was back when we only had two television channels and a fair fraction of people watched TV on Friday evening). There may also be SMS messages involved in some versions of the system. This acts differently: now it is you who gets told off when you misbehave.

A fundamental difference may be the valence of the framing. The TV show did this as happy entertainment, more of a parody of sport television than an attempt at influencing people. The Chinese system explicitly aims at discouraging misbehaviour. The TV show encouraged positive behaviour (if only accidentally).

So the dimensions here may be the extent of the social effect (locally, or nationwide), the degree the feedback is directly personal or public, and whether it is a positive or negative feedback. There is also a dimension of enforcement: is this something that happens every time you transgress the rules, or just randomly?

In terms of actually changing behaviour making the social effect broad rather than close and personal might not have much effect: we mostly care about our standing relative to our peers, so having the entire nation laugh at you is certainly worse than your friends laughing, but still not orders of magnitude more mortifying. The personal message on the other hand sends a signal that you were observed; together with an expectation of effective enforcement this likely has a fairly clear deterrence effect (it is often not the size of the punishment that deters people from crime, but their expectation of getting caught). The negative stick of acting wrong and being punished is likely stronger than the positive carrot of a hypothetical bouquet of flowers.

Where is the rub?

From an ethical standpoint, is there a problem here? We are subject to norm enforcement from friends and strangers all the time. What is new is the application of media and automation. They scale up the stakes and add the possibility of automated enforcement. Shaming people for jaywalking is fairly minor, but some people have lost jobs, friends or been physically assaulted when their social transgressions have become viral social media. Automated enforcement makes the panopticon effect far stronger: instead of suspecting a possibility of being observed it is a near certainty. So the net effect is stronger, more pervasive norm enforcement…

…of norms that can be observed and accurately assessed. Jaywalking is transparent in a way being rude or selfish often isn’t. We may end up in a situation where we carefully obey some norms, not because they are the most important but because they can be monitored. I do not think there is anything in principle impossible about a rudeness detection neural network, but I suspect the error rates and lack of context sensitivity would make it worse than useless in preventing actual rudeness. Goodhart’s law may even make it backfire.

So, in the end, the problem is that automated systems encode a formalization of a social norm rather than the actual fluid social norm. Having a TV commenter narrate your actions is filtered through the actual norms of how to behave, while the face recognition algorithm looks for a pattern associated with transgression rather than actual transgression. The problem is that strong feedback may then lock in obedience to the hard to change formalization rather than actual good behaviour.

What makes a watchable watchlist?

Watch the skiesStefan Heck managed to troll a lot of people into googling “how to join ISIS”. Very amusing, and now a lot of people think they are on a NSA watchlist.

This kind of prank is of course by why naive keyword-based watch lists are total failures. One prank and it gets overloaded. I would be shocked if any serious intelligence agency actually used them for real. Given that people’s Facebook likes give pretty good predictions of who they are (indeed, better than many friends know them) there are better methods if you happen to be a big intelligence agency.

Still, while text and other online behavior signal a lot about a person, it might not be a great tool for making proper watchlists since there is a lot of noise. For example, this paper extracts personality dimensions from online texts and looks at civilian mass murderers. They state:

Using this ranking procedure, it was found that all of the murderers’ texts were located within the highest ranked 33 places. It means that using only two simple measures for screening these texts, we can reduce the size of the population under inquiry to 0.013% of its original size, in order to manually identify all of the murderers’ texts.

At first, this sounds great. But for the US, that means the watchlist for being a mass murderer would currently have 41,000 entries. Given that over the past 150 years there has been about 150 mass murders in the US, this suggests that the precision is not going to be that great – most of those people are just normal people. The base rate problem crops up again and again when trying to find rare, scary people.

The deep problem is that there is not enough positive data points (the above paper used seven people) to make a reliable algorithm. The same issue cropped up with NSA’s SKYNET program – they also had seven positive examples and hundreds of thousands of negatives, and hence had massive overfitting (suggesting the Islamabad Al Jazeera bureau chief was a prime Al Qaeda suspect).

Rational watchlists

The rare positive data point problem strikes any method, no matter what it is based on. Yes, looking at the social network around people might give useful information, but if you only have a few examples of bad people the system will now pick up on networks like the ones they had. This is also true for human learning: if you look too much for people like the ones that in the past committed attacks, you will focus too much on people like them and not enemies that look different. I was told by an anti-terrorism expert about a particular sign for veterans of Afghan guerrilla warfare: great if and only if such veterans are the enemy, but rather useless if the enemy can recruit others. Even if such veterans are a sizable fraction of the enemy the base rate problem may make you spend your resources on innocent “noise” veterans if the enemy is a small group. Add confirmation bias, and trouble will follow.

Note that actually looking for a small set of people on the watchlist gets around the positive data point problem: the system can look for them and just them, and this can be made precise. The problem is not watching, but predicting who else should be watched.

The point of a watchlist is that it represents a subset of something (whether people or stocks) that merits closer scrutiny. It should essentially be an allocation of attention towards items that need higher level analysis or decision-making. The U.S. Government’s Consolidated Terrorist Watch List requires nomination from various agencies, who presumably decide based on reasonable criteria (modulo confirmation bias and mistakes). The key problem is that attention is a limited resource, so adding extra items has a cost: less attention can be spent on the rest.

This is why automatic watchlist generation is likely to be a bad idea, despite much research. Mining intelligence to help an analyst figure out if somebody might fit a profile or merit further scrutiny is likely more doable. As long as analyst time is expensive it can easily be overwhelmed if something fills the input folder: HUMINT is less likely to do it than SIGINT, even if the analyst is just doing the preliminary nomination for a watchlist.

The optimal Bayesian watchlist

One can analyse this in a Bayesian framework: assume each item has a value x_i distributed as f(x_i). The goal of the watchlist is to spend expensive investigatory resources to figure out the true values; say the cost is 1 per item. Then a watchlist of randomly selected items will have a mean value V=E[x]-1. Suppose a cursory investigation costing much less gives some indication about x_i, so that it is now known with some error: y_i = x_i+\epsilon. One approach is to select all items above a threshold \theta, making V=E[x_i|y_i<\theta]-1.

If we imagine that everything is Gaussian x_i \sim N(\mu_x,\sigma_x^2), \epsilon \sim N(0,\sigma_\epsilon^2), then  V=\int_\theta^\infty t \phi(\frac{t-\mu_x}{\sigma_x}) \Phi\left(\frac{t-\mu_x}{\sqrt{\sigma_x^2+\sigma_\epsilon^2}}\right)dt. While one can ram through this using Owen’s useful work, here is a Monte Carlo simulation of what happens when we use \mu_x=0, \sigma_x^2=1, \sigma_\epsilon^2=1 (the correlation between x and y is 0.707, so this is not too much noise):

Utility of selecting items for watchlist as a function of threshold. Red curve without noise, blue with N(0,1) noise added.
Utility of selecting items for watchlist as a function of threshold. Red curve without noise, blue with N(0,1) noise added.

Note that in this case the addition of noise forces a far higher threshold than without noise (1.22 instead of 0.31). This is just 19% of all items, while in the noise-less case 37% of items would be worth investigating. As noise becomes worse the selection for a watchlist should become stricter: a really cursory inspection should not lead to insertion unless it looks really relevant.

Here we used a mild Gaussian distribution. In term of danger, I think people or things are more likely to be lognormal distributed since it is a product of many relatively independent factors. Using lognormal x and y leads to a situation where there is a maximum utility for some threshold. This is likely a problematic model, but clearly the shape of the distributions matter a lot for where the threshold should be.

Note that having huge resources can be a bane: if you build your watchlist from the top priority down as long as you have budget or manpower, the lower priority (but still above threshold!) entries will be more likely to be a waste of time and effort. The average utility will decline.

Predictive validity matters more?

In any case, a cursory and cheap decision process is going to give so many so-so evaluations that one shouldn’t build the watchlist on it. Instead one should aim for a series of filters of increasing sophistication (and cost) to wash out the relevant items from the dross.

But even there there are pitfalls, as this paper looking at the pharma R&D industry shows:

We find that when searching for rare positives (e.g., candidates that will successfully complete clinical development), changes in the predictive validity of screening and disease models that many people working in drug discovery would regard as small and/or unknowable (i.e., an 0.1 absolute change in correlation coefficient between model output and clinical outcomes in man) can offset large (e.g., 10 fold, even 100 fold) changes in models’ brute-force efficiency.

Just like for drugs (an example where the watchlist is a set of candidate compounds), it might be more important for terrorist watchlists to aim for signs with predictive power of being a bad guy, rather than being correlated with being a bad guy. Otherwise anti-terrorism will suffer the same problem of declining productivity, despite ever more sophisticated algorithms.