Tuesday, 22 March 2022

Evidence of Fabricated Data in a Vitamin C trial by Paul E Marik et al in CHEST

 Below is an email I have sent to Sentara Norfolk General Hospital, the editor of CHEST Journal, and prof Paul E Marik. As always an allegation against a multi-author paper is not an allegation against any individual author. 


To:

The Editor in Chief of "CHEST" Peter J. Mazzone

cc: The Board of Norfolk Sentara Norfolk General Hospital
cc: The Compliance officer of Sentara Norfolk General Hospital
cc: Prof Paul E Marik

Dear Editor

Tonight on twitter a paper allegedly describing a 2017 study by a "Paul Marik" and team at Sentara Norfolk General Hospital describing a large survival benefit from Vitamin C was brought to my attention as a medical research finding that was not replicated in further studies and later reversed. The user stated that to their knowledge no evidence of fraud in the conduct of this study had been identified. The study is found at https://doi.org/10.1016/j.chest.2016.11.036

Unfortunately within about 5 minutes of reading the study it became overwhelmingly clear that it is indeed research fraud and the data is fabricated.

While usually I would use cautious language of "unusual" or "unexpected" patterns in the data and describe "irregularities" and "concern"; no such caution is warranted in this case. This is frankly audacious fraud. I have not requested access to the raw data or contacted the authors for explanation as the case is audacious no other explanation is apparent.

Allow me to explain.

This study allegedly describes a before and after study, examining the effect of a new treatment regime based on vitamin c on mortality in sepsis, claiming a roughly ten fold reduction in death. Each cohort had exactly 47 patients and the patients were not matched. We know this not just because matching was not mentioned but because the authors specify that these were two cohorts of "consecutive" patients, precluding patient matching by definition.

Based on this we would expect that if there was no systemic bias the p values for differences in dichotomous baseline characteristics (gender, demographics, comorbidities, diagnoses etc) would, in most circumstances, centre on 0.5. If systemic differences existed however the groups may be less similar and p values may tend to numbers below 0.5, and this would not be suspicious in a non-randomised study. Systemic biases to p values greater than 0.5 are not usually possible without matching (or some very rare pseudo-block designs not relevant here) except in the setting of fraud.

Unfortunately every single primary diagnosis (Pneumonia, Urosepsis, Primary Bacteremia, GI/Biliary, "Other") is matched perfectly with a p value from a Fisher exact test of 1. 

Indeed of the 22 discrete/continuous variables reported in table 1 the majority have a p value of 1.

The following variables had a p value of 1, i.e. were distributed perfectly evenly across the two time periods (or as maximally perfectly evenly as possible in the case of odd numbers).

Acute Kidney Injury: 61 patients
Vasopressors: 44 patients

Primary Diagnosis - Pneumonia: 37 patients

Heart Failure: 31 patients
Positive blood cultures: 26 patients
Primary Diagnosis - Urosepsis: 21 patients
COPD: 15 patients
Chronic Renal Failure: 15 patients
Primary Diagnosis - Primary Bacteremia: 14 patients
Primary Diagnosis - GI/Biliary: 12 patients
Drug addiction: 10 patients
Primary Diagnosis - Other: 10 patients
Lack of comorbidities: 3 patients

A more fulsome table of p values is annexed to the end of this complaint, however this figure shows the plotted p values for all 22 dichotomous variables in table 1. Note that instead of the even distribution between 0 and 1 (or clustering below 1 if the cohorts were dissimilar) instead most variables have a p value of 1, 21/22 variables have p values over 0.5, and no p value of below 0.4 is seen for any of the 22 variables.

image.png

This actually presents a slight problem in estimating how unlikely these results are, as the most common test for fraud in this situation would be the Stouff-Fisher, but this will declare these results infinitely unlikely (as the majority of variables have a p value of exactly 1), when in reality it is probably more likely that it is no more than trillions to quadrillions to one.

A "quick and dirty" way to assess this would be to consider that the probability of any one variable having a p value over 0.4 is 60%, the binomial probability of 22 such measures having no observed values under 0.4 is close to (0.6)^22 or around 1 in 100,000, this assumes independence which is perhaps a little unfair, but likely underestimates the improbability of such a finding as the results are not evenly distributed between 0.4 and 1.

There is simply no apparent explanation for this other than fraud. The methods are described clearly and these are explicitly two cohorts of consecutive patients, so the only real innocent explanation here (undisclosed matching by baseline characteristics) did not occur (and it clearly didn't anyway as matching patients this exactly would not be possible even with thousands of such patients).

While I understand your need to act fairly, there is overwhelming and irrefutable evidence that data presented in this paper cannot have come from the method described, and can only have been fraudulent, even from the data in the paper alone without any further evidence.

I urge you to retract this paper, or at least issue an expression of concern as soon as possible.

I have CC'd the institution's integrity officer in case they wish to institute integrity investigation proceedings and disciplinary action.

Yours Sincerely
Dr Kyle Sheldrick
Sydney Australia

Annex 1 p values for every dichotomous variable in table 1:
Treated pts withControl Patients WithTreated pts withoutControl Patients Withoutp Value From Fisher Exact test
Male272320240.5354
No Comorbidity2145461
Diabetes162031270.5247
Hypertension202527220.409
heart failure151632311
Malignancy5742400.7586
COPD8739401
Cirrhosis6341440.4856
CVA8539420.5516
CRF7840391
Morbid Obesity6841390.773
Immunocompromised6441430.7398
Drug Addiction5542421
Primary Diagnosis
Primary Diagnosis: Pneumonia181929281
Primary Diagnosis: Urosepsis111036371
Primary Diagnosis: Primary Bacteremia7740401
Primary Diagnosis: GI/Billiary6641411
Primary Diagnosis: Other5542421
Ventilation222625210.5362
Vasopressors222225251
AKI313016171
Positive blood cultures131334341


______

19 comments:

  1. Edited, there are 22 not 23 dichotomous variables. (In converting the table "Primary Diagnosis" became a row but was actually a subheading)

    ReplyDelete
  2. If Dr Sheldrick's conclusion is correct, this article exposes the weak under-belly of much of today's research, the greed and dishonesty of researchers, how easily deception can be missed by gate-keepers in the dissemination of error when it originates with 'experts'. It is very difficult to quantify the harm done to patients.

    ReplyDelete
    Replies
    1. Thank you Dr. Sheldrick, for "double checking the mats," and for speaking out about scientific misconduct when you notice it. According to Google Scholar, this article has been cited some 780 times...and as a result, the "Marik Vitamin C ICU Treatment Protocol" is apparently in use today, in hospitals all over the world. And sadly (but unsurprisingly, given your findings) -- Dr. Marik has been promoting Vitamin C as treatment for COVID-19. Here's a link to one of those articles, which has now been officially "RETRACTED, one suspects due to your statistical acumen: https://pubmed.ncbi.nlm.nih.gov/33317385/. However, the impact of Dr. Marik's choices do NOT end there. The next big challenge for medicine is, of course, "Long COVID" aka "PASC." Well, in this just-published overview of the current state of clinical trials of treatments for Long COVID...guess which supplement is most frequently mentioned in the Tables? That's right: Vitamin C. See: https://pubmed.ncbi.nlm.nih.gov/35282780/. So thanks very much for your work. It turns out that one man CAN make a difference, even if the other man has n=780 Google Scholar Citations.

      Delete
  3. The paper does not appear retracted as of today. Did you receive a response from the journal, or Dr. Marik himself in defense of the paper?

    ReplyDelete
  4. I agree that the baseline data of this study is very suspiciously overbalanced, but your analysis is problematic. The p-value of Fisher's "exact" test is in general not exactly uniformly distributed in [0,1] under the null hypothesis and the deviations from uniformity are substantial when samples and numbers of events are as small as they are in this study. A better option would be to use the mid-p version of Fisher's p-value whose distribution is closer to uniform and expected value is exactly 1/2 (under the null hypothesis). Using the mid-p Fisher's p-value for count variables and a simple (approximate) z-test for continuous variables (excluding Procalcitonin, for which mean and standard deviation are not supplied) and running a million simulations (see R code below) I got that the probability of a mean p-value as high as it was observed in the study for count variables is of the order of 1/800 and that the probability of a mean p-value as high as it was observed in the study for all variables (both count and continuous, excluding Procalcitonin) is of the order of 1/290. In both cases I'm assuming that p-values for different variables are independent, which is a little unfair, as you mentioned. So, to summarize, the baseline data is indeed suspiciously overbalanced (and the fact that the study is not randomized makes it more suspicious --- unless there was some active attempt to find patients that had similar baseline data), but the oddity of this baseline data is not of the order of "trillions to quadrillions to one" or of 1 in one hundred thousand, but more like something of the order of 1 in a few hundreds, depending on what statistics you use for the test. Also, one has to be careful with post hoc analysis of data, as if you dig enough and run too many tests you will always find stuff that "looks suspiciously improbable".

    fishermidp = function(k,s,n){
    # contingency table
    # k s-k
    # n-k n-(s-k)
    k = min(k,s-k)
    return(2*phyper(k-1,s,2*n-s,n)+ifelse(2*k==s,dhyper(k,s,2*n-s,n)/2,dhyper(k,s,2*n-s,n)))
    }

    # data for count variables
    treated = c(27,2,16,20,15,5,8,6,8,7,6,6,5,18,11,7,6,5,22,22,31,13)
    control = c(23,1,20,25,16,7,7,3,5,8,8,4,5,19,10,7,6,5,26,22,30,13)

    # data for continuous variables
    med1 = c(58.3,20.6,2.7,1.9,8.3,22.1,79.5,39.7)
    sd1 = c(14.1,13.5,1.5,1.4,2.8,6.3,16.4,16.7)
    med2 = c(62.2,17.1,3.1,1.9,8.7,22.6,82,41.6)
    sd2 = c(14.3,13.4,2.8,1.1,3.7,5.7,27.4,24.2)

    n = length(treated) # number of count variables
    ncont = length(med1) # number of continuous variables
    N = 47 # number of patients per group

    # p-values for continuous variables - z test (approximate)
    zval = abs(med1-med2)/sqrt((sd1^2+sd2^2)/N)
    pvalcont = 2*pnorm(zval,lower.tail=FALSE)

    pval = numeric(n) # Fisher mid p-values for count variables
    p = numeric(n) # estimated probability of event for count variables
    for(i in 1:n)
    {
    pval[i] = fishermidp(treated[i],treated[i]+control[i],N)
    p[i] = (treated[i]+control[i])/(2*N)
    }
    meanpvaldiscrete = mean(pval)
    meanpvalall = mean(c(pval,pvalcont))

    repetitions = 10^6
    meansdiscrete = numeric(repetitions) # simulated means of p-values for count variables
    meansall = numeric(repetitions) # simulated means for all p-values
    pvalsimul = numeric(n)
    for(i in 1:repetitions)
    {
    for(j in 1:n)
    {
    tr = rbinom(1,N,p[j])
    ct = rbinom(1,N,p[j])
    pvalsimul[j] = fishermidp(tr,tr+ct,N)
    }
    meansdiscrete[i] = mean(pvalsimul)
    meansall[i] = mean(c(pvalsimul,runif(ncont))) # assuming uniform distribution for p-values of continuous variables
    }

    print(paste("probability for count variables: 1 in ",repetitions/sum(meansdiscrete >= meanpvaldiscrete)))
    print(paste("probability for all variables: 1 in ",repetitions/sum(meansall >= meanpvalall)))

    ReplyDelete
    Replies
    1. I apologize for not signing the post, I meant to sign it using gmail when I posted it, but for some reason it didn't work. I'm Daniel Tausk from the Mathematics Department of University of São Paulo, Brazil, you can reach me through tausk@ime.usp.br

      Delete
  5. As an interested mathematician and amateur statistician, I have posted a re-analysis of the evidence for fraud in this study: https://rpubs.com/shearerp/sheldrick-fraud-allegation-2022-04-11.

    As the commenters above note, Fisher's exact test may not be a good choice for this analysis. My post develops a significance test from first principles to address this situation. All derivations, code, and results are included. My results are different and the evidence of fraud is less compelling.

    ReplyDelete
    Replies
    1. Thankyou for taking the time to put together such a thoughtful response, on first glance your code and results look very interesting to me. I will have a good read in a couple of weeks time and hopefully leave a comment!

      Delete
    2. I am genuinely unsure how things will shake out. We seem to be in an uncanny valley where the data definitely looks odd but it's unclear whether it's *extraordinarily* unlikely. I may take another pass to try and reach a more definite conclusion. Feel free to reach out privately, my name links to my LinkedIn.

      Delete
    3. Update: I used a Monte Carlo method to assess the overall unlikelihood of the data and came up with about 1 in 3,000. Concerning, but perhaps not enough for a final judgment without additional follow-up.

      Details here: https://rpubs.com/shearerp/sheldrick-allegation-p2-2022-04-12

      Delete
    4. One can use different models and tests to get a sense of how odd the data is and different models/tests will yield different results. There is plenty of room for debate on which model/test is best and I can't see a reason for electing one specific model/test as THE RIGHT ONE. On the other hand, expecting Fisher's exact p-value to follow a uniform distribution in this case is just wrong and this is a simple verifiable mathematical fact. The expected value of Fisher's exact p-value is greater than 1/2 and in fact substantially greater than 1/2 in this case. Running the code below on R you can get a sense of how much. It plots the expected value of Fisher's exact p-value against the probability of event on each arm, assuming 47 patients per arm; it highlights in red the points corresponding to the probabilities of event (estimated from the data) for the 22 count variables in the study and it computes the mean value of the 22 expected values of Fisher's exact p-values. The result is 0.647 (rounding to 3 decimals places). You can check these numbers using Monte Carlo simulation as well.

      Best,
      Daniel Tausk

      fisher = function(k,s,n){
      # Fisher exact p-value
      # contingency table
      # k s-k
      # n-k n-(s-k)
      k = min(k,s-k)
      return(ifelse(2*k==s,1,2*phyper(k,s,2*n-s,n)))
      }

      expfisher = function(n,p){
      # expected value of Fisher exact p-value
      # first line of contingency table sampled from two independent Binom(n;p)
      # both columns of contingency table have sum equal to n
      total = 0
      for(i in 0:n)
      {
      total = total + fisher(i,2*i,n)*dbinom(i,n,p)^2
      }
      for(i in 1:n)
      {
      for(j in 0:(i-1))
      {
      total = total + 2*fisher(i,i+j,n)*dbinom(i,n,p)*dbinom(j,n,p)
      }
      }
      return(total)
      }

      expfisherMC = function(n,p,repetition){
      # expected value of Fisher exact p-value by Monte Carlo simulation
      k1 = rbinom(repetition,n,p)
      k2 = rbinom(repetition,n,p)
      pval = numeric(repetition)
      for(i in 1:repetition)
      {
      pval[i] = fisher(k1[i],k1[i]+k2[i],n)
      }
      return(mean(pval))
      }

      N = 47 # number of patients per arm
      p = seq(from=0,to=1,by=0.001)
      expval = sapply(p,function(x) expfisher(N,x))
      plot(p,expval,type="l",xlab="probability of event on each arm",ylab="expected value of Fisher exact p-value",ylim=c(0.5,1))

      # data for count variables
      treated = c(27,2,16,20,15,5,8,6,8,7,6,6,5,18,11,7,6,5,22,22,31,13)
      control = c(23,1,20,25,16,7,7,3,5,8,8,4,5,19,10,7,6,5,26,22,30,13)
      n = length(treated) # number of count variables

      prob = numeric(n)
      for(i in 1:n)
      {
      prob[i] = (treated[i]+control[i])/(2*N) # estimated probability for event on i-th count variable
      }
      expvalprob = sapply(prob,function(x) expfisher(N,x))
      points(prob,expvalprob,col="red")

      print(mean(expvalprob))

      Delete
    5. Update: I implemented Paul Shearer's approach using my own code (see below) and with 1 million simulations I obtained a probability of the order of 1 in 4500, which is similar to what he obtained (recalling that we are both using Monte Carlo approximations, so we shouldn't expect exact agreement).

      Best,
      Daniel Tausk

      R Code:

      testbinom = function(k,n){
      # probability of closed interval with extremities k and n-k with respect to Binom(n;1/2)
      k = min(k,n-k)
      return(1-2*pbinom(k-1,n,1/2))
      }

      # data for count variables
      treated = c(27,2,16,20,15,5,8,6,8,7,6,6,5,18,11,7,6,5,22,22,31,13)
      control = c(23,1,20,25,16,7,7,3,5,8,8,4,5,19,10,7,6,5,26,22,30,13)

      n = length(treated) # number of count variables
      num = treated+control # number of events per count variable

      pval = numeric(n)
      for(i in 1:n)
      {
      pval[i] = testbinom(treated[i],num[i])
      }
      sumlog = sum(log(pval))

      repetition = 10^6
      sumlogsimul = numeric(repetition)
      pvalsimul = numeric(n)
      for(i in 1:repetition)
      {
      for(j in 1:n)
      {
      tr = rbinom(1,num[j],1/2)
      pvalsimul[j] = testbinom(tr,num[j])
      }
      sumlogsimul[i] = sum(log(pvalsimul))
      }

      print(paste("1 in ",repetition/sum(sumlogsimul <= sumlog)))

      Delete
    6. Now, my comments on Paul Shearer's approach: I think what he did is just fine, but I tried something a little different as a sensitivity analysis and got (surprisingly) quite different results, namely a probability of the order of 1 in 490. Paul Shearer's model is based on considering the total number s of "yes" patients for a given count variable as fixed and the number of "yes" patients for that variable in the treatment group as randomly drawn from Binom(s;1/2). While this is apparently fine, normally when comparing two groups of N patients one would consider the number of "yes" patients in the treatment group as drawn from Binom(N;p) and the number of "yes" patients in the control group as drawn independently from Binom(N;p), where p is the common probability of "yes" for both groups (p depends on the count variable, of course). Now this model introduces an unknown nuisance parameter p, which makes testing harder to handle. A dirty approximate approach would be to replace p with a point estimate obtained from the data. An alternative exact approach (in the spirit of Fisher's test) would be to condition on s, which eliminates the dependence on p: after conditioning on s the number of "yes" treated patients follows a hypergeometric distribution (number of white balls in N draws without replacement from an urn containing 2N balls, of which s are white; equivalently, number of white balls in s draws without replacement from an urn containing 2N balls, of which N are white). This hypergeometric distribution sort of approximates Binom(s;1/2) (with better approximation for s << 2N), which is the distribution used by Paul Shearer. Now if I do exactly what Paul Shearer did, but replacing Binom(s;1/2) with the corresponding hypergeometric distribution, I get a probability of the order of 1 in 490 rather than 1 in 4500 (a big difference, surprisingly). R code is given below.

      On a side note, in my original approach (first comment in the post, in which I used as test statistic the mean of the 22 mid-p Fisher exact p-values) I employed the dirty approximate approach of replacing the nuisance parameter p with its point estimate; I tried later to replace the dirty approach with the exact approach (based on conditioning on s and using the hypergeometric distribution) and got similar results to the dirty approximate approach.

      To summarize, my conclusion after all these analyses is that reasonable approaches to this problem should yield either probabilities of the order of 1 in a few hundred or of 1 in a few thousand.

      Best,
      Daniel Tausk

      R code:

      testhyper = function(k,s,n){
      # probability of closed interval with extremities k and s-k with respect to
      # hypergeometric distribution corresponding to n balls drawn from an urn with 2*n balls and s "success" balls
      k = min(k,s-k)
      return(1-2*phyper(k-1,s,2*n-s,n))
      }

      # data for count variables
      treated = c(27,2,16,20,15,5,8,6,8,7,6,6,5,18,11,7,6,5,22,22,31,13)
      control = c(23,1,20,25,16,7,7,3,5,8,8,4,5,19,10,7,6,5,26,22,30,13)

      n = length(treated) # number of count variables
      N = 47 # number of patients per group
      num = treated+control # number of events per count variable

      pval = numeric(n)
      for(i in 1:n)
      {
      pval[i] = testhyper(treated[i],num[i],N)
      }
      sumlog = sum(log(pval))

      repetition = 10^6
      sumlogsimul = numeric(repetition)
      pvalsimul = numeric(n)
      for(i in 1:repetition)
      {
      for(j in 1:n)
      {
      tr = rhyper(1,num[j],2*N-num[j],N)
      pvalsimul[j] = testhyper(tr,num[j],N)
      }
      sumlogsimul[i] = sum(log(pvalsimul))
      }

      print(paste("1 in ",repetition/sum(sumlogsimul <= sumlog)))

      Delete
  6. This is nice. After I posted I was wondering if we could that distribution between the treatment and control yeses. Using a symmetry argument we can be sure that the mean in each group is half the total yeses, but is it binomial? I assumed it was, and it seems like a good approximation at least, but it would be interesting to have a theoretical result one way or the other.

    ReplyDelete
    Replies
    1. Ah, shoot. It's definitely not binomial. You can see this by thinking about the case where the number of yeses exceeds 47. In this case, not all values 0 through 47 can occur in a single cell, so it's definitely not binomial.

      In lieu of a better approach I would defer to Daniel's above, with the 1 in 490 result.

      Delete
    2. At any rate, it's looking like the fraud evidence here isn't so strong. The p value isn't incredibly low and it could be sensitive to violations of our assumptions like independence and distribution approximations.

      Delete
    3. Good point concerning the hypothetical case in which the total number of yeses exceeds 47. In that case the binomial distribution is clearly wrong. Also, note that in the binomial approach the number 47 is never even used in the analysis, which is weird.

      Concerning a theoretical result, well, one has to assume some stochastic model of how the data was generated in order to do statistical testing, so we can't prove a theorem that yields a distribution out of nothing. But, if we assume that the number k1 of yes patients in the treatment group is sampled from Binom(N1;p) and that the number k2 of yes patients in the control group is independently sampled from Binom(N2;p) (with N1=N2=47 in this example and p unknown) then it is well-known and easy to prove that the conditional distribution of k1 given k1+k2=s is hypergeometric (just note that k1+k2 is Binom(N1+N2;p) and compute the relevant quotient). Assuming that k1 and k2 are independently sampled from binomials with the same parameter p would be the standard approach in the case of a randomized comparative study (of course this one isn't randomized). Finally, I observe that one could use other test statistics instead of the product of the p-values. The product is a natural choice and it is the basis for Fisher's method of combining p-values, but there are other popular methods of combining p-values. I haven't tried other methods and I'm not sure if the results would be similar.

      Daniel Tausk

      Delete
    4. This is great. I think the hypergeometric approach must be about the best possible to this problem. I added a link to your post in my Part 2 notebook along with some disclaimers that the binomial approach is an approximation.

      I don't think 1 in 490 is quite enough to declare fraud, especially given the other uncertainties involved in the analysis, but it is enough for some healthy skepticism and perhaps even challenging the authors to produce the raw data.

      Delete
  7. Generally I don’t read post on blogs, but I wish to say that this write-up very forced me to try and do so! Your writing style has been surprised me. Thanks, very nice post. Best of luck for the next! Please visit my web site hbpublications.com. Best Public Finance Ebook service provider.

    ReplyDelete