Friday, 20 August 2021

Data from Cadegiani et al contains unexplained patterns.

Author's note: I became aware at the very end of drafting the post below that the preprint has now been published as a paper. Unfortunately, the reference to the availability of the dataset and link to the dataset were removed between preprint and paper, presumably due to word limits and journal requirements. However the data is still available at the link which is still in the preprint.



The paper reports a slightly unusual research design. It appears to describe a classical two-arm observational study of patients stratified by treatment with a third arm generated from "precise estimation" from PubMed articles and government statements.

The authors, to their credit, uploaded their data to OSF here.

I laud this transparency, but the data are concerning. Certain features of the data are not consistent with having arisen from either an interventional or observational study.

When I examine studies I generally divide the data into:
  • Dichotomous data
  • Other categorical data
  • Interval (numeric) data.
In this case I am most interested in the dichotomous baseline data. Dichotomous data are often the less-loved cousin for integrity checking, but certain statistical features of data are generally preserved in true study data and less likely to arise in synthetic data from other processes. One of those is the independence of observations for a given baseline variable.

Assessing independence of observations for dichotomous measures in general (not this study):

Imagine one of the questions we ask of every participant in a theoretical study is "have you ever had a stroke?". The answer to this question will not be independent of the other answers the same patient gave. Patients with a history of hypertension or smoking will be more likely to have had a stroke. But, in most cases (and we'll get to the exception in a moment) the risk of one participant having had a stroke will not be determined by whether the patient before them had a stroke.

This leads to an interesting feature of truly random data. By knowing the frequency of a dichotomous data point we can actually predict how often we will see runs of two or three or four of the same result in a row. Humans, when attempting to emulate randomness, generally do not create enough clumps of data and substitute an even distribution for a random one.

Let's take a worked example.

Imagine that 50% of our sample of 1000 participants have had a stroke. 

Out of 999 possible consecutive pairs of patients we would expect 25% of consecutive pairs to have neither subject have a stroke (probability of no stroke^2), 25% of pairs to have both participants have had a stroke (probability of stroke ^2), and 50% of pairs to have 1 participant with a stroke (1 - probability of congruent pairs). 

These expected pairs are calculable for any dichotomous outcome where you have individual patient level data.

I have posted analyses like this previously, for example this was one thing I checked closely in Dr Eduardo Lopez-Medina's study. Below is the expected number of consecutive positive pairs vs the observed positive pairs for Dr Lopez-Medina's trial. As you can see there aren't any concerns here, these appear roughly as frequently as you would expect. Some are slightly lower, some are slightly higher, but they don't deviate massively and there's no systemic biases to more or fewer pairs than expected




At large numbers of observations the frequency of pairing is very reliable with really quite tight expected ranges. For truly independent observations the data should always look something like this.

However, as I indicated earlier there are exceptions. Higher than expected numbers of pairs indicate a positive correlation and can occur especially where the frequency of a variable changes over the course of recruitment or differs between hidden consecutive subgroups. For example if the rate of prior covid infection in the community increased dramatically over the course of recruitment for our theoretical study we would see more than expected pairs because the "1"s would be clumped towards one end of the data set. If there were particular clinics or sites in the data are reported as a consecutive block but not disclosed and those sites have different rates of the baseline variable again pairs will be higher because the "1"s will be clumped together.

So there are a number of plausible situations in which positive correlation between consecutive patients could arise, i.e. where one patient having a feature makes patients after them also more likely to have that feature.

What is much harder to explain is negative correlation resulting in fewer than expected pairs. This is a preference to alternate over clumps of results where a variable is spread more evenly through a data set than we would expect. The most obvious cause for an observational study is that the data was synthesised rather than collected. There are other more far fetched situations in which this could occur in an observational study but none apply here. There's really not any obvious explanation for why one patient having had a stroke in our study should make the next patient significantly less likely to have had a stroke.


Now let's apply this analysis to this study.

So what do we see in the Cadegiani et al data set?

The data set reports a large number of dichotomous variables, in fact columns AG to DP are all dichotomous baseline variables, covering pre-existing conditions and prior medication usage. So, I extracted all of these for the first arm of 192 participants.

When we map the number of observed pairs vs expected pairs for all these variables in which at least one pair was expected we get a graph that looks very different:


These 22 variables for which at least 1 pair was expected have been arranged from the lowest number of expected pairs to the highest. What you can see is across the board there are far fewer pairs than expected. 21 out of 22 variables have fewer pairs than expected and the differences for many are massive.

It can be hard to appreciate how incredibly unlikely these deviations from expected are, so lets pick one out and look at it a few different ways.

Focus on Physical Activity - An example

The variable "regular physical activity" was the last column we extracted in our data set and had a big juicy difference so let's choose that.

There were 56 participants (29%) who were physically active and 136 who were not. Based on this we would expect slightly more than 16 pairs of consecutive physically active participants in the data set but we only saw 2. This is a big difference!

One way to think of this is to answer two questions:

  •  "if a participant is physically active what are the chances the next participant will be physically active?"
  •  "if a participant isn't physically active what are the chances the next participant will be physically active?"
If the two measures are independent, as we would expect, the two answers should be roughly the same.

In this study:
  • of 56 physically active participants 2 were followed by another physically active participant
  • of 135 not physically active participants, 53 were followed by a physically active participant
So if the person above you in the table is active, you only have a 3.5% chance of also being active. On the other hand if the person above you is not active, your odds of being active go up more than 10 times to 39%! 

This is a negative correlation, not a positive one, so can't be easily explained by non random ordering of the list from concating study sites or frequency change over the course of the study.

If we construct this as a contingency table:


If you are a participant in this trial and the participant before you was sedentary, the chance of you being active is 11 times higher!

It is very, very hard to imagine an innocent explanation for a significant negative correlation between consecutive variables in a study of this nature, let alone an effect as truly massive as this. 

But how unlikely is this? Could this just be random chance? I can think of three ways to answer this question.

Firstly we could simply do a chi square test for independence on the contingency table above to test whether there is independence between the observed value and the following value. This gives us a p value of 0.0000006. This puts the probability of a mismatch of this order arising by chance if the data were actually independent of roughly 6 per ten million. 

A second way to think about this is as a binomial probability problem, "for a chance of each pair being positive of ~0.08507 and 191 trials (n-1), what is the probability of 2 or fewer being successes?" Plugging this into a binomial probability calculator gives a cumulative probability of 2 or fewer pairs of 0.000007.

I would argue this is an overestimate though. The binomial probability calculator answers the question of likelihood for an underlying frequency of participants who exercise, but we are really interested in the probability for a fixed observed frequency. 

So the third way is to simulate these results in R, 56 positive and 136 negative, shuffle them randomly and measure the number of pairs. I did this 5,000,000 times, and not one simulation resulted in 2 pairs or fewer. (0.00014% of trials resulted in 3 pairs though). 

When I have more time I will do a higher number of simulations, but the probability of such an enormous deviation on this variable alone is clearly not much higher than a handful in 10 million.

Now consider that 21 out of 22 variables had fewer than expected pairs.

Quite clearly these results taken as a whole are at least at the billions to one level of unlikeliness to occur by chance.

Because the correlation is negative, that is the data are MORE evenly spread throughout the sheet than could occur by chance, this isn't something that unrecognised subgroups, unrecognised ordering by some unincluded variable, or a changing frequency by time alone could explain.

The Authors' Take.

I have spoken to two of the authors separately by videoconference today, Dr Cadegiani and one of his American collaborators.

Both disagreed strongly with the proposition that the data were potentially fraudulent.

However, neither was able to provide any explanation for these irregularities other than:
  • that the design of the study was not randomised;
  • that the data may have been rearranged from the order of collection based on some other variable;
  • that the data may concate subgroups based on day or clinic that may be dissimilar in frequency of observations.
I expressed to both authors that I did not accept any of these as an explanation, as all would be expected to increase the number of observed pairs, and result in less even distribution of observations, not an excessively even distribution as we have here; and that in the absence of some specific driver of negative correlation I consider these irregularities to be unexplained.

To Dr Cadegiani's credit, he did offer that I could go to Brazil and examine the original forms, however given the extremity of the irregularities and the lack of reasonable explanation, even producing a complete set of paper forms would not allay the concerns arising from this data, in my opinion.

My Take.

This paper contains a distribution of numbers that are not what we would expect from genuinely collected observation data. These deviations are so large and so incredibly unlikely we can safely dismiss them arising by chance. The two potential explanations offered to me (unindicated subgroups, rearrangement of data by some variable not included in the data sheet) could not result in these deviations.

Unless study records reveal an explicit explanation for these irregularities (and I can't currently imagine any that would explain such consistent and extreme non-independence of consecutive measures in the form of negative correlation), even if a complete set of paper forms are produced the findings cannot be accepted.

This is a pattern of data that could arise from data that is generated rather than collected.

I would like to be explicit that I am not making an allegation of fraud against any specific author or their associated entities. Even where irregularities arise in data sets with multiple authors that cannot ultimately be explained, it is not usually reasonable to draw negative inference against all the authors involved. Authors are entitled to trust their collaborators, and researchers their employees.

If no convincing reason for these extreme abnormalities can be provided this paper should be retracted, ideally by the authors themselves rather than a forced retraction by the journal.

5 comments:

  1. Like reading a work of data detective fiction - except without the fiction part. Fascinating & awesome.

    ReplyDelete
  2. excellent analysis. nice example of the power of math and statistics. journals really should employ (and pay) people like you, as normal peer-review rarely catches data fraud, and the pandemic has shown how big of a problem this really is, even and especially in top journals like lancet and nejm.

    ReplyDelete
  3. Regular physical activity is a dichotomous variable? No wonder that part of the data gathering is very likely useless junk.

    Minor or major imperfections in proof notwithstanding if there is a cheap, safe drug like ivermectin that has even a small potential beneficial effect, I would rather take that than sit at home with no treatment at all, waiting to see if I get sick enough to go to undergo the extreme desperate treatment in hospital. Maybe the placebo effect itself will help.

    ReplyDelete