Just how do we study effects?

Will this intervention enable more people to earn a living? Will fewer young offenders become repeat offenders as a result of the new approach? And what about possible negative consequences?

Reading time approx. 6 minutes Published: Publication type:

Medical and Social Science & Practice

The SBU newsletter presents and disseminates the results of the SBU reports, describes ongoing projects at the agency, informs about assessment projects at sister organisations, and promotes interest in scientific assessments and critical reviews of methods in health care and social services.

Illustration by Robert NybergTo judge whether or not an intervention is appropriate to put into practice, we would really like to know about its effects. And to obtain reliable answers, research studies must be meticulously planned and conducted. But even if this is the case, pitfalls may be encountered, including the tricks that chance plays on us.

When evaluating effects, researchers and authorities such as SBU have a penchant for experimental studies. In these, outcomes for those who receive a particular intervention are compared with those who receive a control intervention, such as customary care, which is to say the intervention commonly provided. Ideally, participants should be divided into groups based on random selection. Drawing conclusions about effects can still be attempted even when participants have not been randomly assigned to the compared groups, but this requires careful management of potential confounders that may skew the results. Otherwise there is a risk of comparing apples to oranges.

Random effects – a problem with repeated tests

Researchers may have more or less good reasons to assume that a particular treatment or intervention really does have a particular effect before a study is even conducted. In the pharmaceutical world, it is not uncommon to randomly test a large number of candidate drugs in order to select those that are worth investigating more thoroughly in clinical trials. But many statistical tests are associated with an increased risk of drawing improper conclusions as a result of chance outcomes. This problem is less of a concern in social work, where interventions are often complex and researchers need to formulate their research question based on different premises. From a statistical standpoint, however, it is not the research question that is being tested, i.e. whether the intervention in question actually has an effect. Instead, one takes the opposite approach: the researcher makes the hypothetical assumption that the intervention has no effect (assumes that the “null hypothesis” is correct). The researcher then analyzes to what extent the observations in the experiment contradict the null hypothesis.

P-value – popular, much sought after, but also questioned

Illustration by Robert NybergOne result from such analyses is the “p-value.” A p-value is a measure of how unlikely the results are, given that the null hypothesis is indeed correct. Researchers are usually overjoyed to end up with a low p-value because it may be a sign that they are hot on the trail
of something important, but also that they have a good chance of getting their results published in a scientific journal. But p-values are also controversial for various reasons; for example, they hold such high value for researchers that they risk overshadowing the importance of the research question. “Trawling” research data for low p-values (sometimes called p-hacking, data dredging, or data mining) is one of the more serious sins of research. Because for every statistical test that is conducted, there is a small risk that chance may be playing a trick. If researchers run sufficiently many variations on a test, the chances are good of ultimately obtaining a statistically significant result, even if the intervention being tested actually has no effect.

Waiting to formulate a hypothesis until after the data have already been analyzed is sometimes referred to as Hypothesizing After the Results Are Known, or HARKing. In such cases, researchers conduct various analyses and massage the data until they find a result of interest, as reflected by a low p-value. Only in the aftermath of this process is the hypothesis formulated to explain the findings, exactly the opposite of good research practices.

The HARKing phenomenon does not seem to be entirely rare. When researchers in various fields were asked to do some soul-searching, an average of 43% admitted to HARKing at some point in their research careers (1).

It is difficult to control what researchers do behind closed doors, but one way to counteract trawling for low p-values is to require researchers to publish, in advance, a protocol in which they describe the research question and the statistical analyses they plan carry out (2). Such pre-published protocols can be used for control purposes if necessary and can almost be considered as a stamp of quality in itself.

Waiting to formulate a hypothesis until after the data have already been analyzed is sometimes referred to as Hypothesizing After the Results Are Known, or HARKing. In such cases, researchers conduct various analyses and massage the data until they find a result of interest, as reflected by a low p-value.

Size does matter, after all

Over the past decade, p-values, statistical significance and the black-and-white view of research findings that easily materialize in their wake have been widely discussed in the scientific literature. In part, this is because low p-values have not proved to be as reliable and replicable as expected (3), but also because the concepts are often misunderstood (4, 5).

A growing number of proponents are now advocating less emphasis on the importance of p-values and instead recommending that results be presented in such a way that the magnitude of the effect becomes apparent, including the associated margin of error, as reflected by the confidence interval. The reporting of effect results expressed as confidence intervals has become increasingly common in the research literature (6). The width of the confidence interval reflects the uncertainty concerning the magnitude of the average effect. In practice, the interval describes all the values that the effect can assume, and which from a statistical standpoint are not contradicted by the data analyzed. Should the confidence interval be extremely wide, it will fail to provide meaningful insight since no conclusions can then be drawn as to whether the effect even exists and if so, whether or not it is beneficial. But if the interval is narrow, it provides intuitively understandable information that becomes important when deciding whether or not to recommend an intervention.

Then the question becomes how small or large an effect should be in order to be considered relevant in practice. This question cannot be answered by statistics, and instead depends on context and the value ascribed to the effect. Only people are able to make such a judgment call. [PL]


1. Rubin, M. (2017). When Does HARKing Hurt? Identifying When Different Types of Undisclosed Post Hoc Hypothesizing Harm Scientific Progress. Review of General Psychology, 21(4), 308–320.
2. Chalmers I, Altman DG. How can medical journals help prevent poor medical research? Some opportunities presented by electronic publishing. Lancet. 1999 Feb 6;353(9151):490-3.
3. Open Science Collaboration. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015 Aug 28;349(6251): aac4716.
4. Nuzzo, R. Scientific method: Statistical errors. Nature 506, 150–152 (2014).
5. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016 Apr;31(4):337-50.
6. Stang A, Deckert M, Poole C, Rothman KJ. Statistical inference in abstracts of major medical and epidemiology journals 1975-2014: a systematic review. Eur J Epidemiol. 2017 Jan;32(1):21-29.

Back to journal

Page published