At a time when the amount of data available to researchers is vast, trust in science seems to be declining. Scientists created the tools now being used to discredit their work. Is there a connection between the way data is used and the erosion of trust in science? What can be done to combat disinformation? In his new book, , Prof. Gary Smith shares insights and strategies on an important topic in today’s world.
You write in your latest book about threats to the credibility of science and scientists. What’s the root of the problem?
The three most important threats are disinformation, data torturing and data mining.
The first, disinformation, has been the subject of a lot of worried discussion. Distrust of elites in general and scientists in particular is spread and magnified by the internet swamp of fake stories, misleading videos and manipulated social media.
What examples of disinformation come to mind?
The anti-vaccination movement is a deadly example of disinformation spread far and wide through social media. For too many years, I have hoped that social media was a passing fad that would wither away as people come to recognize the thousands of hours they have wasted on gossip and lies and resent being manipulated. The world seems to be moving away from that optimistic hope. ChatGPT and other large language models (LLMs) are going to amplify the firehose of falsehoods enormously. The only silver lining is that perhaps someday people will finally stop trusting the internet.
“Data torturing” sounds serious. What is it?
Researchers torture data when they do whatever it takes to obtain results that support their desired conclusions. They try different models, look at various subsets of the data and discard contradictory observations. For example, the prestigious British Medical Journal published a paper reporting that Chinese- and Japanese-Americans have abnormally high cardiac mortality on the fourth day of the month because they believe the number 4 is unlucky. To reach this preposterous conclusion, the authors looked at detailed heart-disease categories, some of which had an above-average number of fatalities on the fourth day of the month, while others had a below-average number. Overall, there was nothing special about Day 4. The authors only reported results for the categories with an above-average number of deaths.
All attempts to replicate the reported findings using fresh data have failed. This is an example of the replication crisis that is undermining the credibility of science. Findings are breathlessly reported but cannot be repeated.
What’s the problem with “data mining”?
The scientific method begins with a research hypothesis and then uses data to test that theory (with no data torturing allowed!). Data mining reverses the process by ransacking data for a statistical pattern and then making up a theory after the pattern has been discovered. This is also called Hypothesizing After the Results are Known (HARKing). Sometimes, data miners argue that no reasons are needed: Correlation supersedes causation. The problem with data mining is that there are inevitably many statistical patterns in any reasonably sized set of data, and the vast majority of them are meaningless coincidences.
What is an example of HARKing?
A prestigious finance journal published a paper reporting correlations between bitcoin prices and 810 other variables, such as the Canadian dollar-U.S. dollar exchange rate, the price of crude oil, and stock returns in the automobile, book and beer industries. The authors reported finding that bitcoin returns were positively correlated with stock returns in the consumer goods and health care industries and negatively correlated with stock returns in the fabricated products and metal mining industries.
These correlations don’t make any sense, and the authors admitted that they had no idea why these data were correlated: “We don’t give explanations….We just document this behavior.” A skeptic might ask: What is the point of documenting coincidental correlations? Attempts to replicate the reported findings with fresh data failed--another example of the replication crisis in science.
How can scientists restore trust in science?
There are dozens of suggestions in my book. Here are a few:
- Battle disinformation bots by requiring that people who create internet accounts must be verified through clear evidence of identity.
- Teach more (and better) courses on media and on quantitative and scientific literacy.
- Include serious discussion of data torturing and data mining in statistics courses in all disciplines.
- Encourage and reward replication studies. A replication study of an important paper might be a prerequisite for a Ph.D. or other degree in an empirical field.