People who analyze or manipulate data on a regular basis very quickly figure out how imperfect data can be. Scott Alexander has a great post about this. Take surveys for example. The data can be screwed up in many ways. People can misread and misinterpret questions; willfully lie in their answers, either to troll the survey-taker or because they’re embarrassed by their answer; or answer as their idealized self as opposed to their actual self. Usually, this doesn’t matter because if you collect enough information, you’ll generally be able to sort signal from noise and draw some conclusions. But, sometimes being off just a little can lead to mistaken beliefs that have dire consequences (see Brexit or Donald Trump’s election).
Now how does this relate to coronavirus? Well, take the CDC’s website for example. It tells you on May 9, 2020, there were 1,274,036 cases of coronavirus and 77,034 deaths. The precise count leads you to think that the numbers are accurate, but that couldn’t be further from the truth. Though these numbers are prominently displayed, if you click on the “About the Data” footnote, it brings you to a section explaining that those numbers are based on “confirmed & probable cases”. Don’t let the word probable fool you. For a probable case to be reported, someone had to say, “this has not been confirmed by a laboratory test, but they displayed the symptoms for it.” This means that a large number of asymptomatic cases go uncounted and who knows how many deaths. On a theoretical level, we know that the reported cases aren’t accurate, but we don’t really grasp how inaccurate they are. We probably think, “Okay, so obviously there’s not exactly 1,274,036 cases and 77,034 deaths, but that means that there’s probably around 1.3 million cases and 80k deaths.” But the reality of the situation is those reported cases can be off by orders of magnitude. Instead of 1.3 million cases, there could be 2.6 million, or 3.9 million. Instead of 80k deaths, there could be 160k deaths, or 240k deaths. We just don’t know. With deaths especially, we won’t know the toll until months later when we can compare how many people have died compared to the baseline. And that’s the problem. You might say that those are just numbers I made up, and I would concede your point. But, what’s important to remember is that the numbers reported by the CDC everyday is the absolute baseline. A year from now, we might discover that there were actually 160k or 240k deaths by May 9th caused by coronavirus, but we will definitely no discover that there were less than 77,034. Fog of war is dangerous indeed. Here, we mistake precision for accuracy.
Okay, so what’s the point of writing this article? What’s the punchline? Is it to say that the CDC’s data collection is useless? No, that’s not it at all. I’ll get to the punchline but first I want to emphasize that it is important to be collecting this data and even though it’s inaccurate it’s better than nothing; what I want people to understand though, is that these numbers serve as a base and can be off by many orders of magnitude. So now the punchline.
There are two main points that I want to get across. The first is just how important it is for government and people to act once just one case has been reported. We already knew that coronavirus was a big problem in Wuhan, so when the first case was discovered in the U.S., we should have started taking precautions and creating contingency plans. To be clear, we shouldn’t have panicked at that time, and shouldn’t even panic now. But, once the first case is officially reported, it should be clear to everyone that there is already many more cases in the U.S. Just because the CDC says that as of a certain date, there is only one official case in the U.S. doesn’t mean that there is only one case in the U.S. I hope my previous sections make that clear. This virus and other viruses can remain asymptomatic for multiple days, so if we discover just one case, we have to think about not only all the people who have symptoms who just haven’t been discovered yet, but also the asymptomatic carriers. The ship has sailed for this pandemic but it’s a lesson we should learn for the future.
The second is to emphasize how inaccurate comparisons to the flu are. During the early stages of the coronavirus, people would compare the reported numbers to the annual flu numbers. The only problem, as pointed out by an ER doctor in Scientific American is that the flu statistics are estimates, not actual cases. He details how he’s only seen one person die of the flu in all his years as an ER doctor and that his colleagues have had similar experiences. What he ultimately realized was that the 25,000 to 69,000 annual flu deaths number cited by Trump is an estimate from the CDC based on multiplying reported numbers by a coefficient determined by an algorithm. That ultimately represents our best estimate for annual flu deaths, but comparing that number to reported coronavirus deaths is irresponsible. Actual annual counts of flu deaths over the last six years range from 3,448 to 15,620, much lower than the estimates.
Ultimately, I believe that having data on coronavirus across the U.S. is helpful, but citizens should remember that those numbers are not a representation of reality and definitely should not compare those numbers, which are most likely undercounting as is, to estimates. The map is not the territory.