Sunday, 19 December 2021

A published claim of data fabrication in a Molnupiravir trial is invalid.

 Recently "Trial Site News" published a claim that a published trial of molnupiravir in NEJM "contains fabricated data".

This is obviously a massive accusation. I believe that exposing fraud in research is a valid and worthwhile undertaking (obviously) but I do not believe the authors have adequately described evidence for this and their claims collapse under scrutiny.

The trial is here.

The article claiming the data is fabricated is here.

The accusation.

The accusations by the author are short and not well described: "The rates of virus clearance from day 10 to 29 are too similar between the Molnupiravir and Placebo arms. The differences are too small compared with the large standard deviations. The standard deviations are too similar, too."

The authors don't state their assumptions, the statistical test applied, the results of that test, or define any threshold of significance or acceptable FDR. 

The impugned supplementary table looks like this.  



Just eye-balling this I am unimpressed. The data don't look more similar to me than I'd expect given the large arm sizes. I wouldn't glance at this table and think "let's test that", but that shouldn't lead to us dismissing this claim out of hand.

I choose to interpret these claims as "the point observed change in viral loads (for each time point) for the two arms are more similar than would arise by chance, for the given n, assuming no underlying true effect in change in viral loads".

and "the standard deviations of point change in viral loads (for each time point) between the groups are more similar than expected by chance, for the given n, assuming the two cohorts are randomly sampled from a single population"


Is the accusation logical?

Before getting into the nitty gritty of whether these claims are borne out, we should ask ourselves if they are well conceived. That is, if proven, would they constitute evidence the data was not experimentally derived.

In my opinion, even that is very dicey.

Assume we tested the two cohorts and found that their standard deviations were so close we had a p value of 0.9999999999999999, that is that less than 1 in 1 billion trials would be expected to generate such similar standard deviations (assume that we have adequate precision) and significance couldn't be changed by correction for multiple testing. Would this be consistent with the claimed experimental method (in the original NEJM paper) and/or is there a readily available innocent explanation? Quite possibly.

The original study did not use simple randomisation, instead participants were stratified based on time since symptom onset to balance the numbers in each arm. Because of this we expect the arms to be more similar in spread of participants by delay to treatment than would occur by chance alone. If the rate of fall in viral load changes is in any way even partially dependent on time from symptom onset (reasonably possible) then it's quite possible that the groups will be more similar than chance alone would dictate, both in mean difference and equality of variance/standard deviation.

So this accusation is a bit of a non-starter.

Accusations of fraud require a deep understanding of study design, including the randomisation process used, to examine which assumptions about experimentally derived data will be valid. That does not seem to have occurred in this case


Is the accusation accurate?

Identifying excessive similarity of rate of change in a set of serial measurements of two cohorts with progressive dropout between measures from summary data alone (even if it were a valid sign of fraud, which it is not in this case) is non-trivial. My first thought is that as a quick sanity check you could just assume normal distribution of change and treat that as a point measure and do a Student's t-test between groups.

Let's focus on the second accusation, the standard deviations of those measures are excessively similar. We can, of course, check the similarity of variance using a Levene's f statistic, again treating the change from baseline at time (x) as a single point measure. Let's do that and see what we get:



There is no measure in the time frame given (for the full study or any subgroup) which is even close to suspicious. Even the most similar standard deviations of the 9 measures would be expected to turn up in about 1 in 5 comparisons where there is no underlying difference in variance/SD. I am unsure why the claim of over-similarity was made without any apparent testing of any kind.

Perhaps the authors performed some other test, which they failed to mention and did not describe the method or results of. This seems unlikely though and I can't think what test this would possibly be.

Summary and Implications

This was a claim of fraud that was not well defined, used absolute language "contains fabricated data" in a way which was not defensible, which would be illogical and non-probative even if not mathematically incorrect, and which when formally tested doesn't even give the barest signal of suspicion. This collapses under 5 minutes of scrutiny.

Not all accusations of fraud are valid, and it was supremely unwise for this publication to publish such a thinly evidenced and badly described allegation. 

Remember, when examining data for signs of fraud, think to yourself about what assumptions can be validly inferred from the claimed experimental design, how you can test those assumptions, what statistical thresholds are required to reject those assumptions, and what explanations other than fraud may explain any such variations. Usually for our group this takes many months and passes through at least three or four pairs of hands before we go public. 

This is not a game and throwing around false accusations of such low quality as this is unprofessional. 

There's not even a case to answer here. This accusation should never have been published.

Tuesday, 26 October 2021

Silence in Science

 Yesterday Liam Mannix from the SMH and the Age examined the track record of The Burnet Institute in modelling Covid cases and deaths around Australia. The biggest takeaway for me was not the assessment itself, but the near complete lack of independent expert comment. Only people directly affiliated with institute were willing to go on the record about its work.

It was Australia’s most prominent science journalist, in the two biggest newspapers in the country, on the most topical field in science, examining the highest profile work in the field. And not one scientist outside the institute was willing to provide a single comment on the record about their work, positive or negative. It's not that journalists aren't seeking comment. The Age journalist contacted seven different academics for comment.

It was a failure by Australian science and Australian academia to discharge one of its core functions: scrutiny of scientific claims. And it wasn’t an isolated event, it fitted into a broader pattern of failure.

Convenient and simple explanations abound. The most simplistic I see claimed is that “good” scientists are all cowed by powerful elites who will enact retribution and end careers, usually with some tenuous allusion to climate science and Peter Ridd. I don’t think that’s a sufficient explanation.

I’m not saying there isn’t a grain of truth. The perception of career damage from challenging the work of senior people is virtually ubiquitous. When I recently criticized a mask study, I was flooded with private warnings about “ending my career”. Whether they actually eventuate is another matter entirely. I also received many private confirmations from prominent and very senior academics saying that they agreed with my assessment but wouldn’t comment publicly.

Why they felt I needed to hear their assessments but the public did not is unclear to me.

But it doesn’t explain why nobody would comment supportively on the Burnet Institute’s output either. Perhaps no academics agree with them and they are isolated, but that seems quite unlikely to me.

I think it’s more likely that there are other reasons why academics don’t comment publicly. Covid management, and anything that feeds into that, is incredibly controversial. A coarsening of public debate has normalized and legitimized abusive public comments and attacks on any expert, regardless of what they say. Allegations of immorality, bad faith, and character assassination outnumber polite criticisms two to one. Those who support more interventions are accused of seeking the death of children through suicide, those who support less interventions are accused of seeking the death of children through Covid. It’s not a public environment that will lead to more knowledgeable experts choosing to participate.

When academics weigh-up whether to weigh-in on the issue of the day, there’s not a lot weighing on the “pro” side either. Public commentary is not rewarded, formally or informally. While most agree public criticism might damage your career, nobody seriously thinks it can boost it. Where experts do comment on the record, this isn’t recognized for the public service that it is. Every commentator on the entire spectrum from Coatsworth to OzSAGE has been accused publicly and privately of "self-promotion", as though public commentary is some unseemly lower-class trait, unbefitting the restrained and noble world of academia. The academy doesn’t seem to value public comment by academics any more than the public do.

But this private weighing-up by academics occurs in two very public contexts.  

The first is that science journalism doesn’t always create space for science criticism in non-sensationalized ways. The amorphous field of “scicomm” has blurred boundaries between science journalism and science PR. To co-opt Geoffrey Blainey we are still very much in a “three cheers” phase for how we talk about science. Excellent findings, “miracle” drugs and “breakthroughs” lead nightly TV news bulletins around Australia, but I can’t remember the last retraction to feature. Where criticism is published it is often only so it can be co-opted into political attacks. I recently did one TV interview about my criticism of a mask study and the entire introduction was an attack on Daniel Andrews, who needless to say was not an author on the study.

The second context is the genuine, and not perceived precariousness of academics. The extreme risk aversion doesn’t develop in isolation. Almost anybody involved in academia at a university is either casualized, or on a fixed term contract. Industrial protections against recriminations are essentially meaningless if you can simply not be employed next year without going through any disciplinary process at all and without any right to review. It’s not an environment that encourages people to rock the boat.

I don’t think this is a problem that can be solved by individual academics making better choices without addressing these two important contexts.

Whatever the cause or mix of causes we have an environment where the public discussion of important science works increasingly occurs without other scientists actually being involved. And we are all poorer for it.

Friday, 8 October 2021

Data from Niaee et al is not consistent with a genuine randomised controlled trial

What is a Randomised Controlled Trial (RCT) and why do we do them?

Imagine a local celebrity launches a new supplement called ABSORBPRO that helps with the absorption of vitamins from food, so that for the same food intake you get more vitamins. You want to know if it works. You measure the levels of a certain vitamin in 100 people who take ABSORBPRO, and in 100 people who don't. 

The average level in people taking ABSORBPRO is 20, and in people not taking it the average is 10. Is this enough to conclude that it works? Not really. There are lots of reasons why we might see these effects even if the supplement does nothing at all.

People taking the supplement might be wealthier and have a better diet. They might be more health conscious and so pick better foods. They might be less likely to be shift workers. The might be younger, if the people taking the supplement are 30 years old and the people not taking it are 70 you'd expect different levels. There might be different numbers of males and females, smokers and non-smokers, the list goes on. 

To decide whether a drug works we have to see whether outcomes are different when it is used compared to when it is not used in people drawn from the same population. We can attempt to do this by deliberately balancing as many covariates (things like all the characteristics described above) as possible but this is very imperfect and there's always the possibility of some that you didn't measure.

The best way to ensure that your patients in each group are sampled from the same population is to... well... sample them from the same population. RCTs define a single group of people that might benefit from a treatment, recruit them into the study then decide which treatment they get randomly. Unlike trying to select similar patient groups in people already taking the drug, this method works equally well for important covariates you haven't thought of in advance.

Randomisation doesn't mean that the groups will be exactly equal on all covariates, but it does mean there won't be a systemic underlying bias towards one group or the other because they are by definition sampled from the same population.

If a trial is genuine and not fraudulent this allows us to make certain testable assumptions about the groups. For any given continuous variable:

  • the groups were sampled from populations with identical means
  • the groups were sampled from populations with identical variance.

Note this isn't quite the same as saying that the groups have identical means or variances.

The Niaee et al Trial

Dr Niaee and colleagues claim to have randomised 180 patients into six different treatment groups. I do not believe this claim is true.

The six groups have 30 patients each. Two groups did not receive ivermectin (one received a placebo and one did not). Four groups received ivermectin at different doses and frequencies.

Traditionally in table one of an RCT authors describe certain characteristics of patients in each group. Something immediately caught my eye. The number of participants in each arm who had not actually tested positive for the virus was wildly different:

Control Group: 40%
Placebo Group: 53%
Ivermectin Group 1: 23%
Ivermectin Group 2: 23%
Ivermectin Group 3: 3%
Ivermectin Group 4: 30%

The authors claim this had a p value of 0.421 from a Chi Square test. Just eyeballing this it seemed wildly off to me, and when calculating this I got a p value of about 8*10^-4. (The authors now accept that the actual p value should be <0.001 and state this was "a typographical error.")

I contacted the corresponding author on the address given in the journal article and requested raw data but received no response. I then attempted through an email address I found online on an earlier preprint and also received no response. I then attempted through his institutional contact details at his university and also received no response. At that point I gave up and posted a comment to pubpeer pointing out the very unexpected imbalance in baseline data and suggested the trial should not be included in meta-analyses unless IPD could be provided and reviewed.

Some time later I heard that Dr Niaee (who was not corresponding author) had been discussing the trial with some other researchers and obtained his contact details. He provided me with the raw data set. Unfortunately that was much more concerning than the summary data

Individual participant data

The individual participant data contained a number of unexpected features. I'm not going to go through all of them here, mainly because I know that other researchers will be making blog posts over the coming days and that seems like pointless duplication of effort. Edit: Gid M-K's blog post is up here.

A few of the first things to jump out at me where:

- All patients with missing baseline data (6 patients) occurred in a single arm. This is extremely unlikely with less than a 1 in 10,000 chance of occurring if only random chance is at play.

- While the averages from the summary data were similar between groups, the range of values between arms were wildly different.

- Far fewer patients with low oxygen levels (<90) occurred in the ivermectin arms.

- Hypotensive patients appear far more frequently in some arms than others.

So I ran some statistical tests on how unlikely some of these mismatches were. The numbers below are slightly less extreme than those I presented in my first round of criticism to Dr Niaee. Dr Niaee objected to grouping arms into single-dose and multi-dose groups, and demanded I rerun the analysis with 6 independent groups of 30. I agreed.

We already knew that the chi square for whether patients had actually tested positive to corona virus was 21 with a p value of 0.0008, this means the chance of a mismatch this extreme happening in genuinely randomised groups is less than 1 in 1,000.

The chi square for oxygen saturation less than 90 (a common criterion for going to hospital) between groups was 22.7, with a p value of 0.00038, this means the chance of a mismatch this extreme happening in genuinely randomised groups is less than 1 in 1,000.

The chi square for diastolic BP less than 75 between groups was 36, with a p value of 0.00000088, this means the chance of a mismatch this extreme happening in genuinely randomised groups is less than 1 in 1,000,000.

But the biggest differences I'd noted, as I said before, were the differences in range and spread rather than average.

To test some of the differences in ranges I ran pairwise Levene's tests of equal variances between each arm. One advantage of Levene's test is that it has minimal assumptions, and does not require the results in either arm to be normally distributed. The p values are below. Some of these are extreme and suggest that certain arms (especially the high dose arms 5 and 6) simply couldn't have been randomly selected from the same population as the other arms.


For example the p values for a test of equal variance between arm 5 (high dose ivermectin as a single dose) are less than 0.0000000001 when tested against the groups that did not receive ivermectin and less than  0.0000000000000001 when tested against either low dose ivermectin group.

The latter represent chances of roughly 100,000,000,000,000,000 to 1 of such differences arising by chance.

This is an extreme example of heteroskedasticity, where the variance of a measure (in this case diastolic blood pressure) is dependent on another measure. This should NOT arise in a genuine randomised control trial, as the arms by definition are sampled randomly from the same population.

Finer points and Final points

Firstly, it's tempting to dissect some of these differences and build narratives about what impacts they had on the final results. For example it is tempting to think "almost all deaths occurred in patients with oxygen saturation less than 90 at baseline, more of these patients were in the groups that didn't get ivermectin, so this favoured ivermectin". I would caution against this.

The key point here is that this trial clearly was not randomised, there are mismatches that are so extreme between groups they would not happen by chance if you had repeated the trial every day for the age of the universe. This means all of the assumptions based on randomisation are lost.

While you may be able to unpick the impacts of certain imbalances, you can no longer assume there is not an underlying systemic bias on unmeasured covariates. A randomised controlled trial which claims to be randomised and isn't, is not a randomised controlled trial. This is at best now an observational study.

Secondly, it's important to remember that in trials with many authors, even where data are not genuine or methods are not accurately described, it's not a sufficient basis to conclude wrongdoing on the part of any individual author.

Thirdly I've focussed on a couple of problems here that I noticed, there are far more, some of which my colleagues will put up in the next few days.

Summary

This paper claims to describe a trial in which patients were randomly allocated to treatments. This is not true. Extreme differences are seen between groups across multiple variables such as oxygen level, blood pressure, and SARS-Cov-2 test results before they even got their first dose of medication. These differences are so extreme that in some cases the chance of them arising randomly are on the order of a hundred quadrillion (100,000,000,000,000,000) to 1.

We can comprehensively reject any reasonable doubt the study actually occurred as described.

The paper should be retracted.



Friday, 20 August 2021

Data from Cadegiani et al contains unexplained patterns.

Author's note: I became aware at the very end of drafting the post below that the preprint has now been published as a paper. Unfortunately, the reference to the availability of the dataset and link to the dataset were removed between preprint and paper, presumably due to word limits and journal requirements. However the data is still available at the link which is still in the preprint.



The paper reports a slightly unusual research design. It appears to describe a classical two-arm observational study of patients stratified by treatment with a third arm generated from "precise estimation" from PubMed articles and government statements.

The authors, to their credit, uploaded their data to OSF here.

I laud this transparency, but the data are concerning. Certain features of the data are not consistent with having arisen from either an interventional or observational study.

When I examine studies I generally divide the data into:
  • Dichotomous data
  • Other categorical data
  • Interval (numeric) data.
In this case I am most interested in the dichotomous baseline data. Dichotomous data are often the less-loved cousin for integrity checking, but certain statistical features of data are generally preserved in true study data and less likely to arise in synthetic data from other processes. One of those is the independence of observations for a given baseline variable.

Assessing independence of observations for dichotomous measures in general (not this study):

Imagine one of the questions we ask of every participant in a theoretical study is "have you ever had a stroke?". The answer to this question will not be independent of the other answers the same patient gave. Patients with a history of hypertension or smoking will be more likely to have had a stroke. But, in most cases (and we'll get to the exception in a moment) the risk of one participant having had a stroke will not be determined by whether the patient before them had a stroke.

This leads to an interesting feature of truly random data. By knowing the frequency of a dichotomous data point we can actually predict how often we will see runs of two or three or four of the same result in a row. Humans, when attempting to emulate randomness, generally do not create enough clumps of data and substitute an even distribution for a random one.

Let's take a worked example.

Imagine that 50% of our sample of 1000 participants have had a stroke. 

Out of 999 possible consecutive pairs of patients we would expect 25% of consecutive pairs to have neither subject have a stroke (probability of no stroke^2), 25% of pairs to have both participants have had a stroke (probability of stroke ^2), and 50% of pairs to have 1 participant with a stroke (1 - probability of congruent pairs). 

These expected pairs are calculable for any dichotomous outcome where you have individual patient level data.

I have posted analyses like this previously, for example this was one thing I checked closely in Dr Eduardo Lopez-Medina's study. Below is the expected number of consecutive positive pairs vs the observed positive pairs for Dr Lopez-Medina's trial. As you can see there aren't any concerns here, these appear roughly as frequently as you would expect. Some are slightly lower, some are slightly higher, but they don't deviate massively and there's no systemic biases to more or fewer pairs than expected




At large numbers of observations the frequency of pairing is very reliable with really quite tight expected ranges. For truly independent observations the data should always look something like this.

However, as I indicated earlier there are exceptions. Higher than expected numbers of pairs indicate a positive correlation and can occur especially where the frequency of a variable changes over the course of recruitment or differs between hidden consecutive subgroups. For example if the rate of prior covid infection in the community increased dramatically over the course of recruitment for our theoretical study we would see more than expected pairs because the "1"s would be clumped towards one end of the data set. If there were particular clinics or sites in the data are reported as a consecutive block but not disclosed and those sites have different rates of the baseline variable again pairs will be higher because the "1"s will be clumped together.

So there are a number of plausible situations in which positive correlation between consecutive patients could arise, i.e. where one patient having a feature makes patients after them also more likely to have that feature.

What is much harder to explain is negative correlation resulting in fewer than expected pairs. This is a preference to alternate over clumps of results where a variable is spread more evenly through a data set than we would expect. The most obvious cause for an observational study is that the data was synthesised rather than collected. There are other more far fetched situations in which this could occur in an observational study but none apply here. There's really not any obvious explanation for why one patient having had a stroke in our study should make the next patient significantly less likely to have had a stroke.


Now let's apply this analysis to this study.

So what do we see in the Cadegiani et al data set?

The data set reports a large number of dichotomous variables, in fact columns AG to DP are all dichotomous baseline variables, covering pre-existing conditions and prior medication usage. So, I extracted all of these for the first arm of 192 participants.

When we map the number of observed pairs vs expected pairs for all these variables in which at least one pair was expected we get a graph that looks very different:


These 22 variables for which at least 1 pair was expected have been arranged from the lowest number of expected pairs to the highest. What you can see is across the board there are far fewer pairs than expected. 21 out of 22 variables have fewer pairs than expected and the differences for many are massive.

It can be hard to appreciate how incredibly unlikely these deviations from expected are, so lets pick one out and look at it a few different ways.

Focus on Physical Activity - An example

The variable "regular physical activity" was the last column we extracted in our data set and had a big juicy difference so let's choose that.

There were 56 participants (29%) who were physically active and 136 who were not. Based on this we would expect slightly more than 16 pairs of consecutive physically active participants in the data set but we only saw 2. This is a big difference!

One way to think of this is to answer two questions:

  •  "if a participant is physically active what are the chances the next participant will be physically active?"
  •  "if a participant isn't physically active what are the chances the next participant will be physically active?"
If the two measures are independent, as we would expect, the two answers should be roughly the same.

In this study:
  • of 56 physically active participants 2 were followed by another physically active participant
  • of 135 not physically active participants, 53 were followed by a physically active participant
So if the person above you in the table is active, you only have a 3.5% chance of also being active. On the other hand if the person above you is not active, your odds of being active go up more than 10 times to 39%! 

This is a negative correlation, not a positive one, so can't be easily explained by non random ordering of the list from concating study sites or frequency change over the course of the study.

If we construct this as a contingency table:


If you are a participant in this trial and the participant before you was sedentary, the chance of you being active is 11 times higher!

It is very, very hard to imagine an innocent explanation for a significant negative correlation between consecutive variables in a study of this nature, let alone an effect as truly massive as this. 

But how unlikely is this? Could this just be random chance? I can think of three ways to answer this question.

Firstly we could simply do a chi square test for independence on the contingency table above to test whether there is independence between the observed value and the following value. This gives us a p value of 0.0000006. This puts the probability of a mismatch of this order arising by chance if the data were actually independent of roughly 6 per ten million. 

A second way to think about this is as a binomial probability problem, "for a chance of each pair being positive of ~0.08507 and 191 trials (n-1), what is the probability of 2 or fewer being successes?" Plugging this into a binomial probability calculator gives a cumulative probability of 2 or fewer pairs of 0.000007.

I would argue this is an overestimate though. The binomial probability calculator answers the question of likelihood for an underlying frequency of participants who exercise, but we are really interested in the probability for a fixed observed frequency. 

So the third way is to simulate these results in R, 56 positive and 136 negative, shuffle them randomly and measure the number of pairs. I did this 5,000,000 times, and not one simulation resulted in 2 pairs or fewer. (0.00014% of trials resulted in 3 pairs though). 

When I have more time I will do a higher number of simulations, but the probability of such an enormous deviation on this variable alone is clearly not much higher than a handful in 10 million.

Now consider that 21 out of 22 variables had fewer than expected pairs.

Quite clearly these results taken as a whole are at least at the billions to one level of unlikeliness to occur by chance.

Because the correlation is negative, that is the data are MORE evenly spread throughout the sheet than could occur by chance, this isn't something that unrecognised subgroups, unrecognised ordering by some unincluded variable, or a changing frequency by time alone could explain.

The Authors' Take.

I have spoken to two of the authors separately by videoconference today, Dr Cadegiani and one of his American collaborators.

Both disagreed strongly with the proposition that the data were potentially fraudulent.

However, neither was able to provide any explanation for these irregularities other than:
  • that the design of the study was not randomised;
  • that the data may have been rearranged from the order of collection based on some other variable;
  • that the data may concate subgroups based on day or clinic that may be dissimilar in frequency of observations.
I expressed to both authors that I did not accept any of these as an explanation, as all would be expected to increase the number of observed pairs, and result in less even distribution of observations, not an excessively even distribution as we have here; and that in the absence of some specific driver of negative correlation I consider these irregularities to be unexplained.

To Dr Cadegiani's credit, he did offer that I could go to Brazil and examine the original forms, however given the extremity of the irregularities and the lack of reasonable explanation, even producing a complete set of paper forms would not allay the concerns arising from this data, in my opinion.

My Take.

This paper contains a distribution of numbers that are not what we would expect from genuinely collected observation data. These deviations are so large and so incredibly unlikely we can safely dismiss them arising by chance. The two potential explanations offered to me (unindicated subgroups, rearrangement of data by some variable not included in the data sheet) could not result in these deviations.

Unless study records reveal an explicit explanation for these irregularities (and I can't currently imagine any that would explain such consistent and extreme non-independence of consecutive measures in the form of negative correlation), even if a complete set of paper forms are produced the findings cannot be accepted.

This is a pattern of data that could arise from data that is generated rather than collected.

I would like to be explicit that I am not making an allegation of fraud against any specific author or their associated entities. Even where irregularities arise in data sets with multiple authors that cannot ultimately be explained, it is not usually reasonable to draw negative inference against all the authors involved. Authors are entitled to trust their collaborators, and researchers their employees.

If no convincing reason for these extreme abnormalities can be provided this paper should be retracted, ideally by the authors themselves rather than a forced retraction by the journal.

Friday, 6 August 2021

Russia appears to have faked the Sputnik Vaccine trial published in the Lancet.

 Over the last 48 hours I’ve been percolating about the massive (30,000+ patient) RCT that has underpinned the Sputnik trial and led to millions of doses being sold by the Russian Government’s RDIF.

I spent most of those tossing up between whether this really was just luck or was fraud. I’ve now found what I believe to be the smoking gun to show that the trial is faked.

Part one: Long odds and long log odds.

I was recently reading the Phase III trial results for the Sputnik V vaccine for Covid-19 in the Lancet and the results are impressive, a 92% vaccine efficacy is nothing to sneeze at. In fact it is very effective in all age groups, and essentially identically so. But then I noticed something odd.

Looking at the actual numbers involved the authors got incredibly lucky. The results for efficacy are improbably uniform across ages. Every single result is 90-93% and every single one can be pushed outside this by adding or subtracting a single case. The actual numbers involved are tiny.

For instance in the first age group only a single person in the vaccine group got covid. And it’s lucky they did. If they didn’t, or a single extra person got covid the results would be way off. With 1 fewer case the result would be 100% effective, with one more it would drop to about 85%. In fact this gives an idea how fragile these results are. Achieving such a tight band of results on such tiny numbers is very lucky. Achieving it five times a row is even more unlikely.

But how unlikely?



 

There are numerous possible ways to compute this.

On my twitter I’ve done a rough hack-and-slash approximation with binomial probability calculator estimating it’s a 1 in 1,000 to 1 in 10,000 chance. While it’s quick and good enough for twitter there’s some problems with it (especially in conferring cumulative probability based on aberrating by 1).

Another way would be to treat each arm as an independent trial, this is valid, because even if the underlying effect IS identical, they’re still independent.

So I did that, and whacked it into Stata (which I bought a license for today, because apparently my uni thinks they shouldn’t provide a license for the medical faculty, only business).

So what did we get?



 

Well the first thing you’ll note is how incredibly lucky the authors got, visually. The square for each age group is the actual result. The further to the left it is the more protection from the vaccine. But the confidence intervals are incredibly wide. We can only be 95% chance the result fell somewhere between 50% effective and >99% effective.

Despite these wide confidence intervals they got bang on the money every time.

But how unlikely is this?



Well we would have expected about 996.4 out of 1,000 repeats to be less even between group. So the researchers got about 3 in 1000 lucky by this estimate.

There are lots of ways to come at this question. I doubt you’ll find anybody who thinks this is more likely than something on the order of x per 1,000 or less likely than on the order x per 100,000. Either way it’s an very rare level of luck to get such very, very similar results with such a tiny number of events.

But luck happens.

There’s no statistical threshold that is impossible. A 1 in 100 result happens in 1 in 100 trials. A 1 in 1000 result happens in 1 in 1000 trials, and so on.

So did the authors get lucky or is it fraud?

Part 2: This isn’t the only vaccine.

I had a number of good conversations on twitter about this trial when I posted my concerns. Two in particular were helpful, one encouraging and one cautionary. Talking to David Manheim (https://twitter.com/davidmanheim) was helpfully cautionary, the inference only works if you assume the trial was faked outright, not just bumped up to make the trial more impressive. And it’s a massive undertaking by thousands, state sponsored and with massive coverage. The pre-test probability of fabricated data has to be low. He emphasized that there are lots of trials and rare events on a single trial basis do occur eventually.

Talking to Florian Naudet (https://twitter.com/NaudetFlorian) was reassuring, he had similar concerns and had published a letter in the lancet but was running into the same wall I was, there’s no level of improbability alone that shows something to be impossible. Rare events happen rarely, but they do happen.

So I decided to sanity check by looking for comparators.

After all Moderna, Pfizer, AZ, J+J all have vaccines that went through phase 3 trials, how homogenous were their results by decade? This led to hours of frustration. I searched high, I searched low, but not one other phase 3 trial by any manufacturer broke down the data into such fine age gradations.

Then it hit me.

Of course they didn’t.

Nobody would plan a trial to do an analysis that only works less than 1% of the time? Why would any statistician make that decision? These are trials of 20-50 THOUSAND patients, costing tens of millions of dollars. These aren’t exactly undergraduate student projects.

It would be crazy to plan this analysis up front.

But to be fair we don’t know that’s what the authors did. (Yet, keep reading.)

So we’re really left with two possibilities:

-        The authors could somehow guarantee IN ADVANCE they would get far more homogeneous data than was statistically likely (i.e. fraud).

-        The authors got very very unexpectedly homogenous data between groups, so decided to add on an analysis that only makes sense in this setting AFTER they had already seen their very lucky data (i.e. the “if you’ve got it, flaunt it” principle).

Part 3: A Question of Timelines.

One of the great things about big journals is that they check things. One of the things top tier journals (NEJM, JAMA, Lancet) are absolute on is they won’t accept a trial that isn’t pre-registered. You can’t come to a journal after the fact and say “so we did this”, you have to publicly put on a register what you are testing and how you define success.

So I went back to the Lancet paper. The Sputnik V trial was registered, and on a register that has absolute transparency of version history.

https://clinicaltrials.gov/ct2/history/NCT04530396?V_1=View#StudyPageTop

 



 

On 27th of August 2020, the researchers submitted their plan for decade by decade analysis to the registry. The month BEFORE the trial even started.

The researcher planned this analysis, which no other vaccine manufacturer did, which had a tiny chance of working as it was wildly under-powered, that relied on the wildly improbably lucky data they collected. The only real explanation here is that they knew what their results were before they injected the first patient. It’s a smoking gun for fraud.

Part 4: Summing Up

I can accept that people get lucky. 1 in 100 chances happen 1 time in 100, 1 in 1000 chances 1 time in 1000.

But I can’t accept knowing you’re going to get that lucky months in advance. These researchers knew they were by August 2020, before the first patient was enrolled.

That can only be fraud.

This trial should be treated as fake. The Lancet article should be retracted immediately. The EMA and other regulators should not move ahead with trusting this trial, even if the missing paperwork is found. 

Part 5: Pre-empting the responses from patriots and troll farms

One argument will be that this isn’t an undepowered study and it was a reasonable analysis because the trial was planned to be larger and run for longer. The luck is just that the homogeneity came so early. That’s rubbish and here’s why.

It ignores the LAYERS of luck and MAGNITUDE of luck that went into this result: not only was the result more homogenous than anybody could have (legitimately) known before it started, the total numbers are higher too. The placebo group had an infection rate more than 1,000% higher than the rate in Russia at the time the trial was planned. Even if we assume they though the rate would go up a bit, the number of events observed are still almost triple the number expected if every day for every patient on the trial had the rate of Russia’s highest rate of infection ever recorded.

If we calculate the number of infections in the first age group with TWICE as many patients (so more than were planned) followed up for the full six months, AND assume it’s 6 months + the 21 days to be generous, using Russia’s infection rate at the time, and assume an efficacy of 75% the trial was approved we get roughly:

6 cases in 1000 controls

4 cases in 3000 vaccinated

The 95% CI for that OR ranges from 0.06 to 0.79. From the most effective Covid vaccine on earth to clinically useless.

NOBODY plans a trial this way; with numbers so small that even a highly effective intervention is quite likely to look like a dud by random chance. At least, nobody who has to rely on an experiment to generate their results does.