Bad Data

One of the subjects I find difficult to convey is the rather unintuitive idea that objectively bad (or otherwise flawed) data is useful. Surprisingly, erroneous data does not invalidate it, nor obligate us to throw it away.

To illustrate this, I like to use an example from signal and data processing, the analog-to-digital conversion. Say you have a 10-bit A-to-D converter. If you don’t know what this is, it is a device that takes in some electrical signal on an input and outputs an integer value between 0 and 1023. With only a calculator, one can then scale that digital value in various ways to “cook” the value into something useful (like a discrete temperature or humidity value).

Now imagine that your signal is further afflicted by 60 Hz line noise bleeding over from your power source, adding additional error to the resulting A/D conversion. Each individual data reading is objectively worse. Due to the noisy error, a value that should be, say, 512 might fluctuate between 510, 511, 512, 513, and so forth.

This creates a lot of uncertainty and you might think it makes the data useless. But, if you take a  whole bunch of different readings over time and average the results together, something rather remarkable happens. You an increase your effective resolution—the precision—from 10-bits to 11-, 12-, or even 13-bits. Now, if a real value was 512.3, you can more closely represent that than the closest integers (512 or 513). Even with just two points, the average of 512 and 513 is 512.5, which is closer to the real value of 512.3. It gets even better the more data points you have.

This is called oversampling.

The point is that error itself, despite being bad, inadvertently conveys useful information.

Now, consider this claim by a Muslim on Twitter:

Jvnior

Christians have:

– Catholic Bible
– Protestant Bible
– Orthodox Bible
– 66 books vs 73 books vs 78 books
– Thousands of manuscript variants

Muslims have 1 Quran orally preserved.

Having just read about the engineering example, can you find the flaw in this Muslim’s thinking?

If you said, “this is just like oversampling,” then you are correct. The addition of so many bad variants with numerous errors quite often makes the end product much, much more accurate. In fact in a number of seemingly paradoxical cases, we are more confident of the accuracy of the Bible because of the transmission errors.

Aidan Mattis

The hilarious part is that he doesn’t seem to realize that this makes Christianity’s texts more reliable from a historic standpoint, and also contradicts the Islamic argument against the authenticity of the Gospels.

We have dozens of manuscripts which match one another in substance despite a total lack of standardization. Christians simply disagree on which books are canon.

We also know that there were Quranic manuscripts that didn’t match the current Islamic canon because they recorded the fact that Uthman burned them.

Muslims like to claim that since there are no Gospel manuscripts from the first century, they must have been passed down orally, and therefore the documents aren’t reliable; but Junior here just claimed that oral transmission makes the Quran more reliable.

Incredible degree of cognitive dissonance going on here.

I admit that this is highly non-intuitive, but so much about mathematics (and statistics in particular) is not intuitive. But what about the comment that inspired this article?

Bruce G. Charlton

I agree with your general point, but I do not have your faith in data.

As an ex-epidemiologist, I regarded most (nearly all) data wrt health and medicine to be *bad* data – and badly-interpreted (due to a combination of the poor quality – careerist – people in academia; and the perverse incentives relating to research, publication and status).

And bad data is worse than none, because it is actively misleading.

On these grounds, and given that I regard the problem of bad data overwhelmingly to be the norm; we are thrown back on personal knowledge and experience – that is, on anecdote.

The validity of anecdote depends on the honesty and competence of those providing anecdotes – we need to regard the source as a “good witness”; and that of course means that we need to be able to evaluate the person – which usually cannot be done without some degree of sustained personal interaction.

My problem with the anecdotal data on Manosphere sites is that either the people are very obviously bad witnesses (not honest, or not competent to know what thy claim) or else I don’t know anything about them (often they are Anonymous or Pseudonymous!) and therefore must assume they are bad witnesses, who ought to be ignored…

Because a bad witness, like bad data, is actively misleading.

The problem is, of course, mostly with people. Data is quite often poorly interpreted by people who have perverse incentives to misinterpret and mislead, by people who are incompetent, and by people who are poor and dishonest witnesses. In a not insignificant number of cases, people are fabricating the data entirely.

The key then is recognizes the potential sources of error and, wherever possible, accounting for them. Consider what I wrote on the “Origin of Covid” back in January of 2021. I examined a February, 2020 paper (and its various revisions) by T. Koyama, D. Platt & L. Parida: “Variant analysis of COVID-19 genomes.”

In the process, I saw exactly the kind of manipulation that Charlton warns about: “perverse incentives relating to research.” I even contacted one of the paper’s authors and got the kind of answer I would expect from someone who was being paid to come to a particular conclusion and therefore unwilling to do anything to jeopardize that. Yet, the data itself was incredibly useful. Damning in fact.

It is astounding to me that, back in February of 2020, there was a paper that that debunked many of the later narratives surrounding the origin, spread, and even the identity of SARS-COV-2 and the incorrectly named disease COVID-19. Before most Americans had even heard of a ‘wet market’ or a ‘zoonotic crossover’—well before any lockdowns, interventions, or panic—those theories had already been discredited. And, more importantly, we knew why, despite all the erroneous information.

I could say much more and provide many different examples (e.g. here), but I think I have made my point well enough. Data may be subject to numerous problems, but it remains useful. Even completely fabricated data has a story to tell (albeit not the one you expected).

2 Comments

  1. professorGBFMtm

    One of the subjects I find difficult to convey is the rather unintuitive idea that objectively bad (or otherwise flawed) data is useful. Surprisingly, erroneous data does not invalidate it, nor obligate us to throw it away.

    To illustrate this, I like to use an example from signal and data processing, the analog-to-digital conversion. Say you have a 10-bit A-to-D converter.

    Conversion is why i’m so concerned about ”RPGenius” ”leaders” following in their matrix idols footsteps,bras & panties:

    https://www.youtube.com/watch?v=bm88Dnb3vmg
    These brothers created Matrix… and now they are sisters!

    THAT puts a whole new spin on modern ”manosphere” langly trolls sayings & beliefs like ”wife-beating is CHRISTian”, ”@n@l is sanctifying” & ”or@l s@d@my is @merican ,churchian ”good”, & r@dpilly” huh?

  2. bruce g charlton

    Your argument is essentially incorrect; because you have been misled by an example where the error is random. Random errors can be averaged – and therefore doo not present much of a problem, provided the sample size is large enough.

    But in biology (and psychology and medicine) the errors in bad are nearly always mainly systematic. In other words, the sample is biased – and typically the extent and nature of bias is not known, and no statistical methods can eliminate it.

    This is why science requires control of interfering variables – to eliminate/ minimize their effect before measurements are made – and sufficient control is mandatory – there are no statistical fudges or shortcuts.

    I make the argument here (and elsewhere) –

    https://charltonteaching.blogspot.com/2018/01/the-uses-and-abuses-of-meta-analysis.html

    https://academic.oup.com/qjmed/article-abstract/90/2/147/1612946

    https://charltonteaching.blogspot.com/2018/01/the-uses-and-abuses-of-meta-analysis.html

    I honestly do not think my point is up for debate, once it has been grasped.

    In biological sciences, as a strong generalization, bad data is useless/ misleading.

Leave a Reply

Your email address will not be published. Required fields are marked *