Trustworthiness of Covid-19 data – a Benford’s Law analysis

June 29, 2021

Development

The year was 1881. There were no such things as calculators, so, when someone needed to compute something more complex, such as, say, a logarithm, they had to use logarithm tables – i.e., big catalogs containing the value of each logarithmic function.

One day, when using such a log table, Simon Newcomb, a Canadian - American astronomer noticed something peculiar – that the first few pages of the catalog(that started with 1) are much more worn out (meaning that they are more touched and used by people) than the others, and he decided to investigate the reasons behind this phenomenon.

Boy, did he have a lot of free time.

Nevertheless, something great came out of this, because what he discovered is that in a large **randomly** produced set of natural numbers, there is an expectation that the first digit of said numbers is smaller rather than larger. That is, there’s a higher chance for the first digit to be 1, rather than 2, 2 rather than 3,and so on.

Pretty interesting, right?

Then, about 60 years later, in 1938, physicist Frank Benford came, noticed the same occurrence, tested it on 20 different domains, from rivers, to molecular weights and the population of the US, and got credited for it. Classic.

That’s how Benford’s Law was born. And what it states is that real-world distributions that span several orders of magnitude rather uniformly (e.g., populations of cities, or stock-market prices), are likely to satisfy Benford's Law to a very high accuracy. However, a distribution that is mostly or entirely within one order of magnitude is unlikely to satisfy Benford's Law very accurately, or at all (take the heights of human adults for example, which mostly span only between 150 and 200 cm).

The exact distribution of the first digits according to Benford’s Law can be observed in Graph 1, where each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit.

And the easiest way to understand why this happens is through log probabilities. If the probability of one digit, d, occurring, is proportional to the space between d and d + 1 on a logarithmic scale, then, a number x will start with digit 1 if log 1 ≤ log x < log 2, and it will start with digit 9 if log 9 ≤log x < log 10. And since the interval [log 1, log 2] is much wider than the interval [log 9, log 10]

(0.3vs 0.05), then there is a higher chance for x to fall into the wider interval, i.e., to start with 1, rather than with 9. I don’t know about you, but I was a bit mind blown.

The keyword in everything presented so far is **randomly**. If the data weren’t random enough, or if, say, someone meddled with it (turns out humans are incapable of producing truly random numbers), then Benford’s Law wouldn’t apply.

That’s how you get an incredibly powerful anti-fraud tool out of something as basic as a first digit. As we speak, this technique is being used to detect tax evasion, electoral fraud, and the like. Based on the plausible assumption that people who fabricate figures tend to distribute their digits fairly uniformly, a simple comparison of first digit frequency distribution from the data with the expected distribution according to Benford's Law ought to show up any anomalous results.

Now let’s talk Covid-19 data. Since about March 2020, when the pandemic reached the entire globe, almost every government of every country started sharing information about new and total cases of the new disease, and about new and total deaths caused by it. Normally, this kind of data wouldn’t have any reasons not to obey Benford’s Law – it spans on several orders of magnitude, and it’s indeed random. Or is it?

Using data from The World Health Organization, I’ve managed to analyze a total of 121 countries and their reporting trustworthiness. To ensure accuracy of results, I have only included countries with more than 10K total cases, and, for computational ease, I have combined together total cases and total deaths (Total) and new cases and new deaths (New), making sure to de-duplicate and exclude 0 values if a government didn’t report every day. My timeframe spans from end of February 2020 to end of April 2021.

After calculating first digit percentages, I have also employed a method called MAD –Mean absolute deviation, in order to have a better overview of precisely how much a set of data deviates from Benford’s Law. And I’ve got some interesting results.

By the way, MAD is computed with the following formula (∑|O-B|)/9, where O is the observed distribution, and B is what Benford predicts. Absolute values are necessary so that positive and negative results don’t cancel each other out. And 9, because there are, well, 9 digits. You can see MAD as the average variation from Benford’s Law – the higher the MAD, the madder the data.

When taking into account aggregated results from all countries, things look pretty good. As you can see below, the graphs for Total and New are not that different from what we’ve seen above, for Benford’s Law, and MAD is only 0.5 for Total and 0.4 for New.

However, when we take each country individually, things get a litte bit tricky – i.e., MAD values start varying.

Below you can see a list of the best, and respectively worst 10 countries, by MAD values for Total cases and deaths. Unexpected? A bit.

And what’s even more curious is that the lists for New cases and deaths are almost entirely different from the ones for Totals.

A better perspective is obtained by plotting data for each country on a graph –where the X axis represents the MAD value for Total cases and deaths, and the Y axis represents the MAD value for New cases and deaths.

The extremes are immediately noticeable. What’s also noticeable is that average values lie somewhere between 4 – 5 for Total and 3 – 4 for New. We could say that this is the limit of understandable deviation from Benford’s Law.

Romania is pretty well situated, with 3.48 MAD for Total cases and deaths and 2.4 MAD for New cases and deaths.

And, to add a little bit of salt and pepper, I have decided to dig a little deeper, and look also at how Romania stands in terms of Covid-19 tests and vaccinations reporting. The results are similar – a 3.3 MAD for tests and a 3.9 MAD for vaccinations.

Ok, we’ll take it.

All these findings are interesting, for sure. But what they don’t tell us is *why *they are like this.

I did try building a regression model, to see whether higher MAD values are explained by the low Human Development Index value of a country, but without any luck. There is no correlation there.

So, if it’s not the level of development, what is it?

Maybe politics or economics? Culture?

One can only wonder.