How to calculate the standard error of disease prevalence in serologic surveys: graphical models to the rescue

April 22, 2020

I can’t believe I am writing a follow up post about the analysis of COVID19 seroprevalence data, but it seems we now have a cottage industry of poorly executed and analyzed studies that make wild claims about the prevalence of asymptomatic COVID19 infections. You can read about the Santa Clara study here, which was widely criticized not only methodologically, but also left many prominent commentators wondering about the actual motives of the authors.

The same group is now reporting high seroprevalence rates in Los Angeles, but with an interesting twist: the press release came before the preprint, the data and the methodology were not made available. Hence, I decided I will not bother with these claims of the more recent study, until everything has been disclosed.


I decided to revisit the Santa Clara paper, because of Twitter commentary showing that the authors made numerous mistakes in the calculation of the uncertainty of their estimates:

We could probably work out the issues with the delta method, but I have had way too many zoom sessions for the day to think clearly. In any case, you can go back in time and read about how to compute variances of ratios in this very blog.

However, it makes more sense to show how to compute the prevalence (both point estimate and its uncertainty) with graphical (causal) models. This way, you will never have to use approximations based on Taylor series, unless of course you are trying to keep your math skills sharp.

I will discuss two different graphical models for these calculations: both of them can be given a Bayesian interpretation, but the first model allows feedback of the serologic survey data to influence the estimates of the assay sensitivity and specificity, while the other does not. The second model, is in fact rather rudimentary to reinterpret as a non Bayesian likelihood model, so that even people who don’t dig Bayes (“what is wrong with you?”) may feel comfortable using. In case you have not heard the term “graphical models” before you should not despair: graphical models are extremely clever ways to organize and keep track of probabilistic calculations, while also serving as communication devices that show how the different quantities under consideration influence each other.


The data
Similar to our previous post we will assume that we have access to three different datasets, shown below.

DatasetDataDeterminants of +ve result
Gold Standard
COVID19 +ve
TP, NPSensitivity
Gold Standard
COVID19 -ve
TN, NNSpecificity
Serological SurveyP, NPrevalence, Sensitivity, Specificity
Datasets needed to calculate prevalence in a serological survey

The first dataset is a gold standard dataset of people who are known to have had the disease. This dataset is used to determine sensitivity as a ratio of the True Positive assay results (TP) over all the disease +ve (NP) participants. The second dataset is one of people who are known never to had had the disease. This dataset is used to determine specificity as the ratio of the True Negative (TN) individuals over all individuals without the disease (NN). Such gold standard datasets could be provided by the manufacturer, or could even be determined locally. We assume that these data are available to us, and we will not engage in debate about the disease determinations were made in the gold standard data. In any case, these datasets provide information that is needed to estimate the prevalence in the third dataset, i.e. the serological survey itself. The ratio of positive tests (P) in the serological survey over all participants in the survey (N) , i.e. the survey positivity rate (p) is a mixed bag of true positives and false positives. This rate is related to prevalence (\pi) , sensitivity and specificity through a very simple formula:


p = \frac{P}{N} = \pi\times Sens +(1-\pi) \times (1-Spec)

Knowing the sensitivity, specificity and the positive rate in the survey would allow one to solve the equation above for \pi and thus estimate the prevalence. Since the sensitivity and specificity are almost never known exactly, e.g. they are estimated from samples, one would like to take into account the uncertainty about these determinants of assay performance when calculating the prevalence. This is precisely why we need statistical analyses for!

A full Bayesian solution

The first probabilistic model (which was used in my previous post) relates the observables (second column of the previous table) to the parameters (unobserved quantities of interest) through the following Directed Acyclical Graphical model.

When executed as a Bayesian calculation, this model will take in the data in the table and produce estimates of the prevalence, sensitivity and specificity that are dependent on both the gold standard and survey data. It should be stressed that if this model is used, then the survey data are going to feedback into the estimates of sensitivity and specificity, which will no longer be determined solely by the gold standard data. STAN code to fit this DAG is shown below. Once the program executes, point and uncertainty estimates may directly obtained from Markov Chain Monte Carlo (MCMC) output e.g. by looking at the mean and the standard deviation.

model0.R<-'
functions {
}
data {

  int casesSurvey;          // Cases in survey
  int NSurvey;              // ppl sampled in survey
  
  int TPGoldStandard;       // true positive in Gold Standard Sample
  int NPGoldStandard;       // Total number of positive cases in Gold Standard Sample
  
  int TNGoldStandard;       // true negative in Gold Standard Sample
  int NNGoldStandard;       // Total number of negative cases in Gold Standard Sample


}
transformed data {

}
parameters {
  real<lower=0.0,upper=1.0> Sens;
  real<lower=0.0,upper=1.0> Spec;
  real<lower=0.0,upper=1.0> Prev;

}
transformed parameters {


}
model {

  casesSurvey~binomial(NSurvey,Prev*Sens+(1-Prev)*(1-Spec));
  TPGoldStandard~binomial(NPGoldStandard,Sens);
  TNGoldStandard~binomial(NNGoldStandard,Spec);
  
  Prev~uniform(0.0,1.0);
  Spec~uniform(0.0,1.0);
  Spec~uniform(0.0,1.0);
}
generated quantities {

}'

Compartmentalizing the calculations

Some people may object to a particular feature of the previous model: the feedback of the serological survey data back to the estimates of the sensitivity and specificity. This is an unavoidable feature of the parameterization employed and the logical consistency of Bayesian calculations on graphs. However, there is a way to reorganize the calculation and avoid the feedback from taking place. The relevant model is shown below:

The three datasets are now analyzed separately, and each dataset only influences the binomial probability of the corresponding model: sensitivity, specificity and survey positivity rate. This analysis can be run either in a Bayesian mode, or in frequentist mode (e.g. by running three logistic regressions, one for each dataset). After the analyses have concluded we sample the distributions for the three rates, and compute the prevalence according to the equation:

\pi = \frac{p+Spec-1}{Sens+Spec-1}

This is a simple Monte Carlo calculation, which can be applied to either a Bayesian posterior probability or a sampling distribution (e.g. by sampling from the normal distribution with mean equal to the parameter estimate and standard deviation equal to the standard error of the model fit). Once we have simulated many values for \pi, we discard those that are negative or greater than one, and form means and standard deviations to generate point summaries and uncertainty estimates. The following code, shows how this model can be implemented in STAN. The generated quantities section samples the values of \pi, and the values that are not compatible with the range constraints are discarded in R.

## uses simulation
model1.R<-'
functions {
}
data {
  
  int casesSurvey;          // Cases in survey
  int NSurvey;              // ppl sampled in survey
  
  int TPGoldStandard;       // true positive in Gold Standard Sample
  int NPGoldStandard;       // Total number of positive cases in Gold Standard Sample
  
  int TNGoldStandard;       // true negative in Gold Standard Sample
  int NNGoldStandard;       // Total number of negative cases in Gold Standard Sample


}
transformed data {

}
parameters {
  real<lower=0.0,upper=1.0> Sens;
  real<lower=0.0,upper=1.0> Spec;
  real<lower=0.0,upper=1.0> totposrate;

}
transformed parameters {


}
model {

  casesSurvey~binomial(NSurvey,totposrate);
  TPGoldStandard~binomial(NPGoldStandard,Sens);
  TNGoldStandard~binomial(NNGoldStandard,Spec);
  
  totposrate~uniform(0.0,1.0);
  Spec~uniform(0.0,1.0);
  Spec~uniform(0.0,1.0);
}
generated quantities {
real Prev;

Prev = (totposrate+Spec-1)/(Sens+Spec-1);

}'

Comparison between the two approaches

How different are these two approaches? I used R and STAN to rerun the analysis of the Santa Clara survey (for simplicity, the gold standard datasets from the Stanford and manufacturer were combined together). The code is available as a word document below.

Parameter Full Bayesian ApproachCompartmentalized Approach
Prevalence0.010 (0.005)0.011 (0.005)
Sensitivity0.838 (0.033)0.838 (0.033)
Specificity0.993 (0.003)0.993 (0.004)
Point Estimates and Standard Errors (in parentheses)

The point and uncertainty estimates of all quantities are virtually indistinguishable, showing that in this particular case, feedback from the survey data does not affect the estimates of sensitivity and specificity.

Take home point: leave the delta method behind, or at least use higher order asymptotics when computing variances of ratio variables. Or if you are lazy like me, just fire your Monte Carlo samplers.

Is the COVID-19 Antibody Seroprevalence in Santa Clara County, really 50-85 fold higher than the number of confirmed cases?

April 20, 2020

I am writing this blog post (the first after nearly two years!) in lockdown mode because of the rapidly spreading SARSCoV2 virus, the causative agent of the COVID19 disease (a poor choice of a name, since the disease itself is really SARS on steroids).

One interesting feature of this disease, is that a large number of patients will manifest minimal or no symptoms (“asymptomatic” infections), a state which must clearly be distinguished from the presymptomatic phase of the infection. In the latter, many patients who will eventually go on to develop the more serious forms of the disease have minimal symptoms. This is contrast to asymptomatic patients who will never develop anything more bothersome than mild symptoms (“sniffles”), for which they will never seek medical attention. Ever since the early phases of the COVID19 pandemic, a prominent narrative postulated that asymptomatic infections are much more common than symptomatic ones. Therefore, calculations such as the Case Fatality Rate (CFR = deaths over all symptomatic cases) mislead about the Infection Fatality Rate (IFR = deaths over all cases). Subthreads of this narrative go on to postulate that the lockdowns which have been implemented widely around the world, are an overkill because COVID19 is no more lethal than the flu, when lethality is calculated over ALL infections.

Whereas, the politicization of the lockdown argument is of no interest to the owner of this blog (after all the virus does not care whether its victim is rich or poor, white or non white, Westerner or Asian), estimating the prevalence of individuals exposed to the virus but never developed symptoms is important for public health, epidemiological and medical care reasons. Since these patients do not seek medical evaluation, they will not detected by acute care tests (viral loads in PCR based assays). However such patients, may be detected after the fact by looking for evidence of past infection, in the form of circulating antibodies in the patients’ serum. I was thus very excited to read about the release of a preprint describing a seroprevalence study in Santa Clara County, California. This preprint described the results of a cross – sectional examination of the residents in the county in Santa Clara, with a lateral flow immunoassay (similar to a home pregnancy kit) for the presence of antibodies against the SARSCoV2 virus. The presence of antibodies signifies that the patient was not only exposed at some point to the virus, but this exposure led to an actual infection to which the immune system responded by forming antibodies. These resulting antibodies persist for far longer than the actual infection and thus provide an indirect record of who was infected . More importantly, such antibodies may be the only way to detect asymptomatic infections, because these patients will not manifest any symptoms that will make them seek medical attention, when they were actively infected. Hence, the premise of the Santa Clara study is a solid one and in fact we need many more of these studies. But did the study actual deliver? Let’s take a deep dive in the preprint.


What did the preprint claim to show?
The authors major conclusions are :

  1. The population prevalence of SARS-CoV-2 antibodies in Santa Clara County implies that the infection is much more widespread than indicated by the number of confirmed cases.
  2. Population prevalence estimates can now be used to calibrate epidemic and mortality projections.

Both conclusions rest upon a calculation that claims :

These prevalence estimates represent a range between 48,000 and 81,000 people infected in Santa Clara County by early April, 50-85-fold more than the number of confirmed cases.

https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v1

If these numbers were true, they would constitute a tectonic shift in our understanding of this disease. For starters, this would imply that the virus is substantially more contagious than what people think (though recent analyses of both Chinese and European data also show this point quite nicely). Secondly, if the number of asymptomatic infections is very high, then perhaps we are close to achieving herd immunity and thus perhaps we can relax the lockdowns. Finally, if the asymptomatic infections are numerous, then the disease is not that lethal and thus the lockdowns were an over-exaggeration , which killed the economy for the “sniffles” or the “flu”. Since the author’s argument rests upon a calculation, it is important to ensure that the calculation was done correctly. If we find deficiencies in the calculation, then the author’s conclusions become a collapsing house of cards. Statistical calculations of this sort can mislead as a result of poor data AND/OR poor calculation. While this blog post focuses on the calculation itself, there are certain data deficiencies that will be pointed out along the way.

How did the authors carry out their research?

The high level description of the approach the authors took, may be found in their abstract:

 We measured the seroprevalence of antibodies to SARS-CoV-2 in Santa Clara County. Methods On 4/3-4/4, 2020, we tested county residents for antibodies to SARS-CoV-2 using a lateral flow immunoassay. Participants were recruited using Facebook ads targeting a representative sample of the county by demographic and geographic characteristics. We report the prevalence of antibodies to SARS-CoV-2 in a sample of 3,330 people, adjusting for zip code, sex, and race/ethnicity. We also adjust for test performance characteristics using 3 different estimates: (i) the test manufacturer’s data, (ii) a sample of 37 positive and 30 negative controls tested at Stanford, and (iii) a combination of both

https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v1

First and foremost, we note the substantial degree of selection bias inherent in the use of social media for recruitment of participants in the study. This resulted in a cohort that was quite non-representative of the residents of Santa Clara county and in fact the authors had to rely on post-stratification weighting to come up with a representative cohort. These methods can produce reasonable results if the weighting scheme is selected carefully. Andrew Gelman analyzed the scheme and strategy adopted by the study authors. His blog post, is an excellent read about the many design issues of the Santa Clara study and you should read it, if you have not read it already. There are many other sources of selection bias identified by other commentators:

When taken together these comments suggest some serious deficiencies with the source data. Notwithstanding these issues, a major question still remains: namely whether the authors properly accounted for the assay inaccuracy in their calculations. Gelman put it succinctly as follows:

This is the big one. If X% of the population have the antibodies and the test has an error rate that’s not a lot lower than X%, you’re in big trouble. This doesn’t mean you shouldn’t do testing, but it does mean you need to interpret the results carefully.

https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaws-in-stanford-study-of-coronavirus-prevalence/

The shorter version , of this argument would thus characterize the positive tests in this “bombshell” preprint as :

A related issue concerns the uncertainty estimates reported by the authors, which is a byproduct of the calculations . Again, this issue was highlighted by Gelman:

3. Uncertainty intervals. So what’s going on here? If the specificity data in the paper are consistent with all the tests being false positives—not that we believe all the tests are false positives, but this suggests we can’t then estimate the true positive rate with any precision—then how do they get a confidence nonzero estimate of the true positive rate in the population?

who similarly to Rupert Beale goes on to conclude:

First, if the specificity were less than 97.9%, you’d expect more than 70 positive cases out of 3330 tests. But they only saw 50 positives, so I don’t think that 1% rate makes sense. Second, the bit about the sensitivity is a red herring here. The uncertainty here is pretty much entirely driven by the uncertainty in the specificity.

To summarize, the authors of the preprint conclude that there many many more infections in Santa Clara than those captured by the current testing, while others looking at the same data and the characteristics of the test employed think that the asymptomatic infections are not as frequent as the Santa Clara preprint claims to be. In the next paragraphs, I will reanalyze the summary data reported by the preprint authors and show they are most definitely wrong: while there are asymptomatic COVID19 presentations, their number is no where close to being 50-80 fold higher than the symptomatic ones as the authors claim.

A Bayesian Re-Analysis of the Santa Clara Seroprevalence Study

There are three pieces of data in the preprint that are relevant in answering the preprint question:

  • The number of positives (y=50) out of (N=3,330) tests performed in Santa Clara
  • The characteristics of the test used as reported by the manufacturer. This is conveniently given as a two by two table cross classifying the readout of the assay (positive: +ve or negative: -ve) in disease +ve and -ve gold standard samples. In these samples, we assume that the disease status of the individuals assayed is known perfectly. Hence a less than perfect alignment of the assay results to the ground truth, points towards assay imperfections or inaccuracies. For example, a perfect assay would classify all COVID19 +ve patients as positive and all COVID19 -ve as assay negative.
Assay +veAssay -ve
COVID19 +ve787
COVID19 -ve2369
Premier Biotech Lateral Flow COVID19 Assay, Manufacturer Data
  • The characteristics of the test in a local, Stanford cohort of patients. This may also be given as 2 x 2 table:
Assay +veAssay -ve
COVID19 +ve2512
COVID19 -ve030
Premier Biotech Lateral Flow COVID19 Assay, Stanford Data

We will build the analysis of these three pieces of data in stages starting with the analysis of the assay performance in the gold standard samples.

Characterizing test performance

Eyeballing the test characteristics in the two distinct gold standard sample collections, immediately shows that the test behaved differently in the gold standard samples used by the manufacturer and by the Stanford team. In particular, sensitivity (aka true positive rate is much higher in the manufacturer samples v.s. the Stanford samples, i.e. 78/85 vs 25/37 respectively. Conversely, the Specificity (or true negative rate) is smaller in the manufacturer dataset than the Stanford gold standard data: 369/371 is certainly smaller than 30/30. Perhaps the reason for these differences may be found in how Stanford constructed their gold standard samples:

Test Kit Performance

The manufacturer’s performance characteristics were available prior to the study (using 85 confirmed positive and 371 confirmed negative samples). We conducted additional testing to assess the kit performance using local specimens. We tested the kits using sera from 37 RT-PCR-positive patients at Stanford Hospital that were also IgG and/or IgM-positive on a locally developed ELISA assay. We also tested the kits on 30 pre-COVID samples from Stanford Hospital to derive an independent measure of specificity.

Perhaps, the ELISA assay developed in house by the Stanford scientists is more sensitive than the lateral flow assay used in the study. Or the Stanford team used the lateral flow assay in a way that decreased its sensitivity. While lateral flow immunoassays are conceptually, “yes” or “no” assays, there is still room for suspect or intermediate results (as shown below for the pregnancy test)

Operation of a lateral flow immunoassay

While no information is given about how the intermediate results were handled (or whether the assay itself generates more than one grades of intermediate results), it is conceivable that these were handled differently by the investigators . For example it is possible that intermediate results the manufacturer labelled as positive in their validation, were labeled as negative by the investigators. Even more likely, the participant samples were handled differently by the investigators, so that the antibody levels required to give a positive result differed in the Santa Clara study relative to the tests run by the manufacturer. In either case, the Receiver Operating Characteristic (ROC) curve in the field differs from the one of the manufacturer. Furthermore, we have very little evidence that it will not differ again in future applications. Finally, there is the possibility of good old fashioned sampling variability in the performance of the assay in the gold standard samples, without any actual substantive differences in the assay performance. Hence, the two main possibilities we considered are:

  • Common Test Characteristic Model: We assume that random variation in the performance of the test in the two gold standards is at play, and we estimate a common sensitivity and specificity over the two gold standard sets. In probabilistic terms, we will assume the following model for the true positive and negative rates in the two gold standard datasets :
    • TP_i \sim binomial(TP_i+FN_i,Sensitivity) and
    • TN_i \sim binomial(TN_i+FP_i,Specificity) .
      In these expressions, the TP, FN, TN, FP stand for the True Positive, False Negative, True Negative and False Positive respectively, while the subscript, i, indexes the two gold standard datasets.
  • Shifted along (the )ROC Model: In this model, we assume a bona fide shift in the characteristics of the assay when tested in the gold standard tests. This variation in characteristics, is best conceptualized as a shift along the straight line in the binormal, normal deviate plot . This plot transforms the standard ROC plots of sensitivity v.s. 1-specificity so that the familiar ROC bent curves look like straight lines. Operationally, we assume that there will be run specific variation in the assay characteristics when it is applied to future samples. The performance of the assay in the two gold standard sets provide a measure of this variability, which may resurface in subsequent use of the assay. When analyzing the data from the Santa Clara survey, we will acknowledge this variability by assuming that the assay performance (sensitivity/specificity) in this third application (red) interpolates the performance noted in the two gold standards (black).

The shift-along the ROC model is specified by fitting separate specificities and sensitivities for the two gold standard datasets, which are then transformed into the binormal plot scale , averaged and then transformed back to normal probability scale to calculate an average sensitivity and specificity. For computational reasons, we parameterize the models for the sensitivity and specificity in logit space. This parameterization also allows us to explore the sensitivity of the model to the choice of the model priors for the Bayesian analysis.

Model for the observed positive rates

The model for the observed positive counts is a binomial law , relating the positive tests in the Santa Clara sample to the total number of tests:
y \sim binomial(N,Sensitivity*Prevalence+(1-Specificity)*(1-Prevalence)

Priors
For the common test characteristic model, we will assume standard, uniform, non-informative priors for the Sensitivity, Specificity and Prevalence. The model may be fit with the following STAN program:

functions {
}
data {

  int casesSantaClara;   // Cases in Santa Clara
  int NSantaClara;      // ppl sampled in Santa Clara
  
  int TPStanford;       // true positive in Stanford
  int NPosStanford;     // Total number of positive cases in Stanford dataset
  
  int TNStanford;       // true negative in Stanford
  int NNegStanford;     // Total number of negative cases in Stanford dataset
  
  int TPManufacturer;       // true positive in Manufacturer dataset
  int NPosManufacturer;     // Total number of positive cases in Manufacturer dataset
  
  int TNManufacturer;       // true negative in Manufacturer dataset
  int NNegManufacturer;     // Total number of negative cases in Manufacturer dataset

}
transformed data {

}
parameters {
  real<lower=0.0,upper=1.0> Sens;
  real<lower=0.0,upper=1.0> Spec;
  real<lower=0.0,upper=1.0> Prev;

}
transformed parameters {


}
model {

  casesSantaClara~binomial(NSantaClara,Prev*Sens+(1-Prev)*(1-Spec));
  TPStanford~binomial(NPosStanford,Sens);
  TPManufacturer~binomial(NPosManufacturer,Sens);
  TNStanford~binomial(NNegStanford,Spec);
  TNManufacturer~binomial(NNegManufacturer,Spec);

  Prev~uniform(0.0,1.0);
  Spec~uniform(0.0,1.0);
  Spec~uniform(0.0,1.0);
}
generated quantities {
  real totposrate;
  real rateratio;
  totposrate = (Prev*Sens+(1-Prev)*(1-Spec));
  rateratio = Prev/totposrate;

For the shifted-along-ROC model, we explored two different settings of noninformative priors for the logit of sensitivity and specificity:

  • a logistic(0,1) prior for the logit of the sensitivities and specificities that is mathematically equivalent to a uniform prior for the sensitivity and specificity in the probability scale.
  • a normal(0,1.6) prior for the logits. While these two densities are very similar, the normal (black) has thinner tails, shrinking the probability estimates more than the logistic one (red) .
Comparison of logistic(0,1) vs normal (0,1.6) prior densities

The shifted-ROC model is fit by the following STAN code (only the normal prior model is inlined below)

functions {
}
data {

int casesSantaClara;   // Cases in Santa Clara
int NSantaClara;      // ppl sampled in Santa Clara

int TPStanford;       // true positive in Stanford
int NPosStanford;     // Total number of positive cases in Stanford dataset

int TNStanford;       // true negative in Stanford
int NNegStanford;     // Total number of negative cases in Stanford dataset

int TPManufacturer;       // true positive in Manufacturer dataset
int NPosManufacturer;     // Total number of positive cases in Manufacturer dataset

int TNManufacturer;       // true negative in Manufacturer dataset
int NNegManufacturer;     // Total number of negative cases in Manufacturer dataset

}
transformed data {

}
parameters {
  real<lower=0.0,upper=1.0> Prev;
  real logitSens[2];
  real logitSpec[2];

}
transformed parameters {
  real<lower=0.0,upper=1.0> Sens;
  real<lower=0.0,upper=1.0> Spec;

  // Use the sensitivity and specificity by the mid point between the two kits 
  // mid point is found in inv_Phi space of the graph relating Sensitivity to 1-Specificity
  // note that we break down the calculations involved in going from logit space to inv_Phi
  // space in the generated quantities sessions for clarity

  Sens=Phi(0.5*(inv_Phi(inv_logit(logitSens[1]))+inv_Phi(inv_logit(logitSens[2]))));
  Spec=Phi(0.5*(inv_Phi(inv_logit(logitSpec[1]))+inv_Phi(inv_logit(logitSpec[2]))));
}
model {

  casesSantaClara~binomial(NSantaClara,Prev*Sens+(1-Prev)*(1-Spec));
  TPStanford~binomial_logit(NPosStanford,logitSens[1]);
  TPManufacturer~binomial_logit(NPosManufacturer,logitSens[2]);

  TNStanford~binomial_logit(NNegStanford,logitSpec[1]);
  TNManufacturer~binomial_logit(NNegManufacturer,logitSpec[2]);

  logitSens~logistic(0.0,1.0);  
  logitSpec~logistic(0.0,1.0); 
  

  Prev~uniform(0.0,1.0);
}
generated quantities {
// various post sampling estimates
  real totposrate;        // Total positive rate
  real kitSens[2];        // Sensitivity of the kits   
  real kitSpec[2];        // Spec of the kits
  real rateratio;         // ratio of prevalence to total positive rate

totposrate = (Prev*Sens+(1-Prev)*(1-Spec));
kitSens=inv_logit(logitSens);
kitSpec=inv_logit(logitSpec);
rateratio = Prev/totposrate;
}

Estimation of the Prevalence of COVID19 Seropositivity in Santa Clara County

Preprint author analyses and interpretation

Table 2 of the preprint summarizes the estimates of the prevalence of seropositivity in Santa Clara. This table is reproduced below for ease of reference

ApproachPoint Estimate95% Confidence Interval
Unadjusted1.501.11 – 1.97
Population-adjusted2.812.24-4.16
Population and test performance adjustedManufacturer’s data2.491.80 – 3.17
Stanford data4.162.58 – 5.70
Manufacturer + Stanford Data2.752.01 – 3.49
Estimated Prevalence of COVID19 Seropositivity Bendavid et al

The unadjusted estimate is computed on the basis of the observed positive counts (50/3,330 tests) and the population adjusted estimate is obtained by accounting for the non-representativeness of the sample using post-stratification weights. Finally, the last 3 rows give the estimates of the prevalence while accounting for the test characteristics. The basis of the comment that the prevalence of COVID19 infection is 50-85 fold higher is based on the estimated point prevalence of 2.49 and 4.16 respectively. Computation of these estimates, require the sensitivity and specificity for the immunoassay which the authors quote as:

Our estimates of sensitivity based on the manufacturer’s and locally tested data were 91.8% (using the lower estimate based on IgM, 95 CI 83.8-96.6%) and 67.6% (95 CI 50.2-82.0%), respectively. Similarly, our estimates of specificity are 99.5% (95 CI 98.1-99.9%) and 100% (95 CI 90.5-100%). A combination of both data sources provides us with a combined sensitivity of 80.3% (95 CI 72.1-87.0%) and a specificity of 99.5% (95 CI 98.3-99.9%).

The results obtained are crucially dependent upon the values of the sensitivity and specificity, something that the authors explicitly state in the discussion:

For example, if new estimates indicate test specificity to be less than 97.9%, our SARS-CoV-2 prevalence estimate would change from 2.8% toless than 1%, and the lower uncertainty bound of our estimate would include zero. On the other hand, lower sensitivity, which has been raised as a concern with point-of-care test kits, would imply that the population prevalence would be even higher. New information on test kit performance and population should be incorporated as more testing is done and we plan to revise our estimates accordingly.

While the authors claim they have accounted for these test characteristics, the method of adjustment is not spelled out in detail in the main text. Only by looking into the statistical appendix do we find the following disclaimer about the approximate formula used:

There is one important caveat to this formula: it only holds as long as (one minus) the specificity of the test is higher than the sample prevalence. If it is lower, all the observed positives in the sample could be due to false-positive test results, and we cannot exclude zero prevalence as a possibility. As long as the specificity is high relative to the sample prevalence, this expression allows us to recover population prevalence from sample prevalence, despite using a noisy test.

In other words, the authors are making use of an approximation, that relies on a assumption about the study results that may or may not be true. But if one is using such an approximation, then one has already decided what the results and the test characteristics should look like. We will not make such assumption in our own calculations.

Bayesian Analysis

The Bayesian analyses we report here roughly corresponds to a combination that Bendavid et al did not consider, namely an analysis using only the test characteristics, but without the post-stratification weights. Even though a Bayesian analysis that used such weights is possible, the authors do not report the number of positives and negatives in each strata, along with the relevant weights. If this information had been made available, then we could repeat the same calculations and see what we obtain. While, I can’t reproduce the exact analysis examination of Table 2 in Bendavid et al suggests that weighting should increase the prevalence by a “fudge” factor. For the purpose of this section we take the fudge factor to be equal to the ratio of the prevalence after population adjustment over the unadjusted prevalence: 2.81/1.50 ~ 1.87 .
For fitting the Bayesian analyses, I used the No U Turn Sampler implemented in STAN; Isimulated five chains, using 5,000 warm-up iterations and 1,000 post-warm samples to compute summaries. Rhat for all parameters was <1.01 for all models (R and STAN code for all analyses are provided here to ensure reproducibility).


Sensitivities and specificities for he three Bayesian analyses: common-parameters, shifted-along-ROC (logistic prior), shifted-along-ROC (normal prior) are shown below:


SensitivityModelMeanMedian95% Credible Interval
Common Parameters0.8360.8380.768 – 0.897
Shifted-along-ROC (logistic)0.8110.8130.733 – 0.882
Shifted-along-ROC (normal)0.8100.8130.734 – 0.878
Specificity
Common Parameters0.9930.9940.986 – 0.998
Shifted-along-ROC (logistic)0.9910.9910.983 – 0.998
Shifted-along-ROC (normal)0.9890.9880.982 – 0.995
Bayesian Analysis of Assay Characteristics

The three models seem to yield similar, yet not identical estimates of the assay characteristic and one may even expect that the estimated prevalences would be somewhat similar to the analyses by the preprint authors.
The estimated prevalence by the three approaches (without application of the fudge factor) are shown below:

Model MeanMedian95% Credible Interval
Common Parameters1.031.050.16-1.88
Shifted-along-ROC (logistic)0.810.810.50 – 1.84
Shifted-along-ROC (normal)0.560.500.02 – 1.46
Model Estimated Prevalence (%)

Application of the fudge factors would increase the point estimates to 1.93%, 1.52% and 1.05% which more than 50% smaller than the prevalence estimates computed by Bendavid et al. To conclude our Bayesian analyses, we compared the marginal likelihoods of the three models by means of the bridge sampler. This computation allows us to roughly see which of the three model/prior combinations is supported by the data. This analysis provided overwhelming support for the Shifted-along-ROC model with the normal (shrinkage) prior:

ModelPosterior Probability
Common Parameters0.032
Shifted-along-ROC (logistic)0.030
Shifted-along-ROC (normal)0.938

Conclusions

In this rather long post, I undertook a Bayesian analysis of the seroprevalence survey in Santa Clara county. Contrary to the author’s assertions, this formal Bayesian analysis suggests a much lower seroprevalence of COVID19 infections. The estimated fold increase is only 20 times higher (point estimate), rather than 50-80 fold higher and with a 95% credible interval 0.75 -55 fold for the most probable model.

The major driver of the prevalence, is the specificity of the assay (not the sensitivity), so that particular choices for this parameter will have a huge impact on the estimated prevalence of COVID19 infection based on seroprevalence data. Whereas the common parameter and the shifted-along-ROC models yield similar estimates for the same prior (the uniform in probability scale is equivalent to standard logistic in the logit scale), a minor change to the priors used by the shifted-along-ROC model, leads to results that are qualitatively and quantitatively different. The sensitivitity of the estimated prevalence to the prior, suggests that the assay performance data are not sufficient to determine either the sensitivity or the specificity with precision. Even the minor shrinkage implied by changing the prior from the standard logistic to the standard normal , provides a small, yet crucial protection against overfitting and leads to a model with extremely sharp marginal posterior likelihood.

Are the Bayesian analysis results reasonable? I still can’t trust the quality and the biases inherent in the source data, but at least the calculations were done lege artis . Interestingly enough, the estimated prevalence of COVID19 infection (1-2%) by the Bayesian methodology described here, is rather similar to the prevalence (<2%) reported in Telluride, Colorado.

The analysis reported herein could be made a lot more comparable to the analyses reported by Bendavid et al, if the authors had provided the strata data (positive and negative test results and sample weights). In the absence of this information, the use of fudge factors is the best equivalent approximation to the analyses reported in the preprint.

Is the Santa Clara study useful for our understanding of the COVID19 epidemic?

I feel that the seroprevalence study in Santa Clara is not the “bombshell” study that I initially felt it would be (so apologies to all my Twitter followers whom I may have misled). There are many methodological issues starting with the study design and extending to the author’s analyses of test performance. Having spent hours troubleshooting the code (and writing the blog post!), my feelings align perfectly with Andrew Gelman’s:

I think the authors of the above-linked paper owe us all an apology. We wasted time and effort discussing this paper whose main selling point was some numbers that were essentially the product of a statistical error.

I’m serious about the apology. Everyone makes mistakes. I don’t think they authors need to apologize just because they screwed up. I think they need to apologize because these were avoidable screw-ups. They’re the kind of screw-ups that happen if you want to leap out with an exciting finding and you don’t look too carefully at what you might have done wrong.

COVID19 is a serious disease, which is far worse than the flu in terms of contagiousness, severity and lethality of the severe forms and level of medical intensive care support needed to avert a lethal outcome. Knowing the prevalence will provide one of the many missing pieces of the puzzle we are asked to solve. This piece of information will allow us to mount an effective public health response and reopen our society and economy without risking countless lives.

Disclaimer: I provided the data for my analyses with the hope that someone can correct if I am wrong and corroborate me if I am right. Open Science is the best defense against Bad science and making the code available is a key step in the process.

Sequential Fitting Strategies For Models of short RNA Sequencing Data

June 18, 2017

After a (really long!) hiatus I am reactivating my statistical blog. The first article  concerns the clarification of a point made in the manual of our recently published statistical model for short RNA sequencing data.
The background for this post, in case one wants to skip reading the manuscript (please do read it !), centers around the limitations of existing methods for the analysis of data for this very promising class of biomarkers. To overcome these limitations our group comprised from investigators from Division of Nephrology, University of New Mexico and the Galas Lab at Pacific Northwest Research Institute introduced a novel method for the analysis of short RNA sequencing (sRNAseq) data. This method (RNAseqGAMLSS), which was derived from first principles modeling of the short RNAseq process, was shown to have a number of desirable properties in an analysis of nearly 200 public and internal datasets:

  1. It can quantify the intrinsic, sequencing specific bias of sRNAseq from calibration, synthetic equimolar mixes of the target short RNAs (e.g. microRNAs)
  2.  It can use such estimates to correct for the bias present in experimental samples of different composition and input than the calibration runs. This in turns opens the way for the use of short RNAseq measurements in personalized medicine applications (as explained here)
  3. Adapted to the problem of differential expression analysis, our method exhibited  greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of short RNA-seq data.
  4. In contrast to these popular methods which force the expression profiles to have a certain form of symmetry (equal number and magnitude of over-expressed and under-expressed sequences), our method can be used to discover global, directional changes in expression profiles which are missed by the aforementioned methods. Accounting for such a possibility may be appropriate in certain instances, in which the disease process leads to loss or gain in the number of cells of origin of the affected organ/tissue.

The proposed methodology which is based on Generalized Additive Models for Location, Scale and Shape (GAMLSS) involves the fitting of simultaneous regression models for the location (mean) and the scale (dispersion) of sequence counts using either the Negative Binomial or a particular parameterization of the Normal distribution. However there is price to pay for the advantages of RNASeqGAMLSS over alternatives: this comes in the form of a small (but not infinitesimal) probability[1] that the fitting algorithm will execute successfully. In the manual of our method (Section 6.4) we explain that a numerically more stable way of fitting these complex models exists and should be adapted if one encounters numerical optimization errors with the default approach used in the Nucleic Acids Research (NAR) manuscript. The three steps of this sequential fitting strategy are as follows:

  1. One fits a Poisson count mixed model to the RNAseq data, to obtain estimates of the relative mean expression for the different short RNA species in the expression profile
  2. These estimates are used to fix the values of the mean parameter model of the RNASeqGAMLSS model while estimating the values of the dispersion parameters.
  3. Finally, one uses the values of the mean (Step 1) and dispersion (Step 2) parameters to fit the general RNASeqGAMLSS model

In essence one ignores the overdispersion (additional variability) of the short RNAseq data (Step 1) to guide the algorithm into estimates of the dispersion parameters (Step 2). Finally one uses the separate estimates of the mean (Step 1) and dispersion (Step 2) parameters as an initial point for the simultaneous estimation of both (Step 3). The reason that this approach works is because the dispersion parameters do not affect the mean parameters, so that the Poisson distribution of Step 1 has the same mean as the RNASeqGAMLSS model. Hence the estimates produced by this Step are identical (to the limit of numerical precision) to those that would have been produced by a successful execution of the RNASeqGAMLSS optimization algorithms. Fixing these values when fitting the RNASeqGAMLSS model in Step 2 facilitates estimation of the dispersion parameters. Having very good initial guesses for these parameters virtually guarantees convergence of the 3rd Step (which is the only step in the NAR paper).

A fully worked out example is shown below (note that the data used in the NAR paper, source code, manual that includes instructions to compile the source code of the RNASeqGAMLSS and Win64 DLL libraries are all available in the BitBucket repository for this project)

First, we load the data, the C++ libraries and extract the data to the two groups we would like to compare :

library(TMB) ## the TMB framework for optimizati
library(lme4)
## load the DE C++ libraries
dyn.load(dynlib("LQNO_DE"))
## Note about the form of data storage for use with this software
##================================================================================
## the long format should be employed when storing microRNA data for
## GAMLSS type of analyses: in this format, the data are arranged in columns:
## - column miRNA : yields the name of the miRNA
## - column reads : reads of the miRNA from a *single sample*
## - column SampleID: the sample the reads were recorded at
## this is called long format, because data from experiments are stored in
## different rows (rather than appearing as consecutive columns)
##================================================================================
## lads the data from the 286 series
load("286.RData")
## Obtain data for GAMLSS - we will compare the two ratiometric series
datRat<-subset(dat286.long,(Series=="RatioB" | Series =="RatioA") & Amount=="100 fmoles")
datRat$SampleID<-factor(datRat$SampleID)
datRat$Series<-factor(datRat$Series)

## STEP 0: PREPARE THE DATA FOR THE RNASeqGAMLSS FIT
u_X<-as.numeric(factor(datRat$miRNA)) ## maps readings to the identity of the miRNA
u_G<-as.numeric(factor(datRat$Series)) ## maps counts to group
y=datRat$reads ## extract the actual counts
X<-model.matrix(~Series,data=datRat) ## design matrix (ANOVA) for group comparisons

Secondly, we fit the Poisson model (Step 1), using the facilities of the lme4 R package:

## STEP 1: USE A POISSON MODEL TO OBTAIN ESTIMATES FOR THE MU SUBMODEL
##==========================================================================================
## fit the parameters for the mu submodel using the poisson GLM
gl<-glmer(reads~Series+(0+Series|miRNA),data=datRat,family="poisson")

Then we extract the values of these parameters and used them to fix the values of the mean submodel (Step 2):

## STEP 2: USE THE MU MODEL ESTIMATES TO FIT THE PHI SUBMODEL
##==========================================================================================
## initializes standard deviation of RE for the mu submodel
sigmu=sqrt(diag((summary(gl)[["varcor"]])[[1]]))
sigsig=rep(1,max(u_G)) ## initializes standard deviation of RE for the phi submodel
b=fixef(gl) ## initial values for the overall group means (mean submodel)
## initial values for the variation of miRNAs around their group mean (mean submodel)
u_m=as.matrix(ranef(gl)$miRNA)
## Very rough initial values for the phi submodel parameters
s_b=rep(0,ncol(X)) ## initial values for the overall group means (phi submodel)
## initial values for the variation of miRNAs around their group mean (phi submodel)
u_s= matrix(0,max(u_X),max(u_G))
## MAP list that allow us to fix some parameters to their values
MAP<-NULL
MAP[["b"]]<-factor(rep(NA,length(b)))
MAP[["u_m"]]<-factor(rep(NA,length(c(u_m))))
MAP[["sigmu"]]<-factor(rep(NA,length(sigmu)))
## construct the AD object - note that we fix the mu at their values while estimating the
## phi submodel
obj.TMB<-MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G),
parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s,
sigmu=sigmu,sigsig=sigsig),
DLL="LQNO_DE",random=c("u_s"),hessian=FALSE,silent=TRUE,
method="BFGS",random.start=expression(last.par.best[random]),
ADReport=TRUE,map=MAP)
## parameter estimation - note errors may be generated during some iterations
f.TMB<-nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr,
control=list(eval.max=10000,iter.max=10000),lower=-30,upper=30)
## obtain the report on the parameters to extract the fitted values of the gamlss model
rep<-sdreport(obj.TMB)
u_s = matrix(summary(rep,"random",p.value=FALSE)[,1],ncol=max(u_G))
dummy<-summary(rep,"fixed",p.value=FALSE)[,1] ## parameter estimates
s_b=dummy[1:max(u_G)]
sigsig=dummy[-(1:max(u_G))]

Finally, we refit the model letting all parameters vary:


## STEP 3: REFIT THE MODEL WITHOUT FIXING ANY PARAMETERS
##==========================================================================================
obj.TMB<-MakeADFun(data=list(y=y,X=X,u_X=u_X,u_G=u_G),
parameters=list(b=b,s_b=s_b,u_m=u_m,u_s=u_s,
sigmu=sigmu,sigsig=sigsig),
DLL="LQNO_DE",random=c("u_m","u_s"),hessian=TRUE,silent=TRUE,
method="BFGS",random.start=expression(last.par.best[random]),
ADReport=TRUE)
## scale objective by the magnitude of the deviance of the fitted Poisson model
f.TMB<-nlminb(obj.TMB$par,obj.TMB$fn,obj.TMB$gr,
control=list(eval.max=10000,iter.max=10000,scale=deviance(gl)),lower=-30,upper=30)
## obtain the report on the parameters
rep<-sdreport(obj.TMB)
## differential expression ratios, standard errors z and p-values
gamlssAD<-summary(rep,"report",p.value=TRUE)[1:nlevels(datRat$miRNA),]
rownames(gamlssAD)<-levels(datRat$miRNA)
## rownames are the miRNAs; columns are estimates, standard error, z value and pvalue

## the final estimates with their standard errors
head(gamlssAD)

These steps (and the RNASeqGAMLSS code) is going to be incorporated in an upcoming Bioconductor package for the analysis of short RNA sequencing data by Dr Lorena Pantano PhD. Until this package becomes available, the aforementioned code snippets may adapted very easily to one’s application by suitable adaptations of the code (i.e. the names of the columns corresponding to the RNA species, sample identifiers and experimental groups).

1. This is particularly likely when the underlying software implementations are not compiled against the Intel®Math Kernel Libraries.

Why should I be Bayesian when my model is wrong?

May 9, 2017

http://wp.me/plc6Z-8Gn

Excellent discussion and some nice references about Bayesian modelling in the context of model misspecification 

My 2015 blogging report

December 29, 2015

https://statmd.wordpress.com/2015/annual-report/

Estimating the mean and standard deviation from the median and the range

December 3, 2015

While preparing the data for a meta-analysis, I run into the problem that a few of my sources did not report the outcome of interest as means and standard deviations, but rather as medians and range of values. After looking around, I found this interesting paper which derived (and validated through simple simulations), simple formulas that can be used to convert the median/range into a mean and a variance in a distribution free fashion.  With

  • a = min of the data
  • b = max of the data
  • m = median
  • n = size of the sample

the formulas are as follows:

Mean  \bar{m} = \frac{a+2 m+b}{4} +\frac{a-2 m+b}{4 n}

Variance  \frac{1}{n-1} \Big(a^2+m^2+b^2+\frac{n-3}{2} \frac{(a+m)^2+(b+m)^2}{4}-n \bar{m} \Big)

 

The following R function will carry out these calculations

f<-function(a,m,b,n)
{
mn<-(a+2*m+b)/4+(a-2*m+b)/(4*n)
s=sqrt((a*a+m*m+b*b+(n-3)*((a+m)^2+(m+b)^2)/8-n*mn*mn)/(n-1))
c(mn,s)
}

Edit

Surfing around arxiv, I found another paper that handles additional scenarios and proposes alternative formulas

Summing up my criticism against Propublica’s Scorecard

August 26, 2015

The bottom-line: The major justification for my substantiated (or at least so I think) rant against the use of shrinkage in this case is the elephant in the room that no-one wants to talk about: the substantially limited information (avg number of complications per individual < 50) for otherwise infrequent events (avg complication rate <5%). In such a situation, shrinkage will make everyone look like everyone else, limiting the ability to draw meaningful complications by looking at the values of the random effects. In fact, PP had to rely on post hock classifications of the good, the bad and the ugly to overcome the unavoidable shrinkage towards the mean and overcome the lack of information that could distinguish one surgeon from another. Similar points (with more examples from the actual score card) were made by Ewen Harrison.

(You may stop reading now – or you may read past the rant in the next paragraph)

Rant Mode On: Another criticism that I received, is that I fail to understand random effects modeling, which I personally thought it was funny because one of my recent papers actually says one should scrap the Cox survival model for generalized additive models (which are just generalized linear mixed models in disguise).  In any case, assuming that my understanding of mixed models is poor, maybe my conclusion that these models are to be preferred for such applications is also problematic?

Since one may consider my vote-of-confidence-to-the-almighty-mix-model v.s. my criticism of their application in the Scorecard project, subtle evidence for paranoid schizophrenia let me sum up how one can simultaneously hold these beliefs:

Even though mixed models are to be preferred especially as more data accumulate(a point I make clear in the 3 blog posts I wrote), no modeling could overcome the severe lack of events in the scorecard database.

There are other technical issues that one could “rant” about e.g. how were the hospital and surgeon effects were combined, but I will not digress any further.

Rant mode off

Back to criticism: In my opinion a major reason that mixed models were used is the potentially large number of surgeons with zero complications in their limited observation records*. The use of shrinkage models allow one to report performance for such surgeons, and generate nice colorful graphs (with one and possibly two decimal points) about their performance instead of reporting a NaN (Not A Number). Incidentally, the shrinkage of the individual surgeon effects all the way towards the population mean is the mixed model’s way of telling you the same thing: since it could not estimate how these people actually performed, it might as well give back the only answer consistent with the model applied, i.e. everybody is (almost) like everybody else and thus everyone is average.

*Curiously this information, i.e. the number of surgeons without any complication in the source database is not provided, although I’d have thought it would be important to report this number in the shake of transparency.

Word of caution: For surgeons without any complications in the database, the actual information comes not from their complications but from the number of uncomplicated surgeries they performed. This raises the question of inclusion effects (a point Andrew Gelman thoroughly address in his book) and the associated selection biases inherent in comparing surgeons who accept v.s. those who do not accept Medicare (or accept too little of it) and the corresponding social determinants of health outcomes (beautifully explained here).

A non-statistical criticism: As a nephrologist, I find it amusingly insulting that Acute Kidney Injury (coded as Acute Renal Failure) was not considered a surgical complication, even though it certainly complicates surgeries. Surgery is in fact the leading cause of “hospital-acquired” AKI (40% of cases are preceded by a surgical procedure). But maybe, as Benjamin Davies pointed out, the real end-points are not really measured by the Scorecard. Or I could be just delusional when I point out to surgical colleagues that the selection of analgesia, fluid management and pre as well as post perioperative care DO play a role for some complications. Or it could be just the anesthesiologists’s fault 🙂

 

The little mixed model that could, but shouldn’t be used to score surgical performance

July 30, 2015

The Surgeon Scorecard

Two weeks ago, the world of medical journalism was rocked by the public release of ProPublica’s Surgeon Scorecard. In this project ProPublica “calculated death and complication rates for surgeons performing one of eight elective procedures in Medicare, carefully adjusting for differences in patient health, age and hospital quality.”  By making the dataset available through a user friendly portal, the intended use of this public resource was to “use this database to know more about a surgeon before your operation“.

The Scorecard was met with great enthusiasm and coverage by non-medical media. TODAY.com headline “nutshelled” the Scorecard as a resource that “aims to help you find doctors with lowest complication rates“. A (?tongue in cheek) NBC headline tells us the scorecard “It’s complicated“. On the other hand the project was not well received by my medical colleagues. John Mandrola gave it a failing grade in Medscape. Writing at KevinMD.com, Jeffrey Parks called it a journalistic low point for ProPublica. Jha Shaurabh pointed out a potential paradox  in a statistically informed, easy to read and highly entertaining piece. In this paradox, the surgeon with the higher complication case who takes high risk patients from a disadvantaged socio-economic environment, may actually be the surgeon one wants to perform one’s surgery! Ed Schloss summarized the criticism (in the blogosphere and twitter) in an open letter and asked for peer review of the Scorecard methodology.

The criticism to date has largely focused on the potential for selection effects (as the Scorecard is based on Medicare data, and does not include data from private insurers), the incomplete adjustment for confounders, the paucity of data for individual surgeons, the counting of complications and re-admission rates, decisions about risk category classification boundaries and even data errors (ProPublica’s response arguing that the Scorecard matters may be found here). With a few exceptions (e.g. see Schloss’s blogpost in which the complexity of the statistical approach is mentioned) the criticism of the statistical approach (including my own comments in twitter) has largely focused on these issues.

On the other hand, the underlying statistical methodology (here and there) that powers the Scorecard has not received much attention. Therefore I undertook a series of simulation experiments to explore the impact of the statistical procedures on the inferences afforded by the Scorecard.

The mixed model that could – a short non-technical summary of ProPublica’s approah

ProPublica’s approach to the scorecard is based on logistic regression model, in which individual surgeon (and hospital) performance (probability of suffering a complication) is modelled using Gaussian random effects, while patient level characteristics that may act as confounders are adjusted for, using fixed effects. In a nutshell this approach implies fitting a model of the average complication rate that is function of the fixed effects (e.g. patient age) for the entire universe of surgeries  performed in the USA. Individual surgeon and hospital factors modify this complication rate, so that a given surgeon and hospital will have an individual  rate that varies around the population average. These individual surgeon and hospital factors are constrained to follow Gaussian, bell-shaped distribution when analyzing complication data. After model fitting, these predicted random effects are used to quantify and compare surgical performance. A feature of mixed modeling approaches is the unavoidable shrinkage of the raw complication rate towards the population mean. Shrinkage implies that the dynamic range of the  actually observed complication rates is compressed. This is readily appreciated in the figures generated by the ProPublica analytical team:

Capture

In their methodological white paper the ProPublica team notes:

While raw rates ranged from 0.0 to 29%, the Adjusted Complication Rate goes from 1.1 to 5.7%. …. shrinkage is largely a result of modeling in the first place, not due to adjusting for case mix. This shrinkage is another piece of the measured approach we are taking: we are taking care not to unfairly characterize surgeons and hospitals.”

These features should alert us that something is going on. For if a model can distort the data to such a large extent, then the model should be closely scrutinized before being accepted. In fact, given these observations,  it is possible that one mistakes the noise from the model for the information hidden in the empirical data. Or, even more likely, that one is not using the model in the most productive manner.

Note that these comments  should not be interpreted as a criticism against the use of mixed models in general, or even for the particular aspects of the Scorecard project. They are rather a call for re-examining the modeling assumptions and for gaining a better understanding of the model “mechanics of prediction” before releasing the Kraken to the world.

The little mixed model that shouldn’t

There are many technical aspects one could potentially misfire in a Generalized Linear Mixed Model for complication rates. Getting the wrong shape of the random effects distribution is of may or may not be of concern (e.g. assuming it is bell shaped when it is not). Getting the underlying model wrong, e.g. assuming the binomial model for complication rates while a model with many more zeros (a zero inflated model) may be more appropriate, is yet another potential problem area. However, even if  these factors are not operational, then one may still be misled when using the results of the model. In particular, the major area of concern for such models is the cluster size: the number of observations per individual random effect (e.g. surgeon) in the dataset. It is this factor, rather than the actual size of the dataset that determines the precision in the individual random affects. Using a toy example, we show that the number of observations per surgeon typical of the Scorecard dataset, leads to predicted  random effects that may be far from their true value. This seems to stem from the non-linear nature of the logistic regression model. As we conclude in our first technical post:

  • Random Effect modeling of binomial outcomes require a substantial number of observations per individual (in the order of thousands) for the procedure to yield estimates of individual effects that are numerically indistinguishable  from the true values.

 

Contrast this conclusion to the cluster size in the actual scorecard:

Procedure Code N (procedures) N(surgeons) Procedures per surgeon
51.23 201,351 21,479 9.37
60.5 78,763 5,093 15.46
60.29 73,752 7,898 9.34
81.02 52,972 5,624 9.42
81.07 106,689 6,214 17.17
81.08 102,716 6,136 16.74
81.51 494,576 13,414 36.87
81.54 1,190,631 18,029 66.04
Total 2,301,450 83,887 27.44

In a follow up simulation study we demonstrate that this feature results in predicted individual effects that are non-uniformly shrank towards their average value. This compromises the ability of mixed model predictions to separate the good from the bad “apples”.

In the second technical post, we undertook a simulation study to understand the implications of over-shrinkage for the Scorecard project. These are best understood by a numerical example from one of simulated datasets. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good Bad Quantile (Good) Quantile (Bad) True OR Pred OR Shrinkage Factor
-0.050 0.050 48.0 52.0 0.905 0.959 1.06
-0.100 0.100 46.0 54.0 0.819 0.920 1.12
-0.150 0.150 44.0 56.0 0.741 0.883 1.19
-0.200 0.200 42.1 57.9 0.670 0.847 1.26
-0.250 0.250 40.1 59.9 0.607 0.813 1.34
-0.300 0.300 38.2 61.8 0.549 0.780 1.42
-0.350 0.350 36.3 63.7 0.497 0.749 1.51
-0.400 0.400 34.5 65.5 0.449 0.719 1.60
-0.450 0.450 32.6 67.4 0.407 0.690 1.70
-0.500 0.500 30.9 69.1 0.368 0.662 1.80
-0.550 0.550 29.1 70.9 0.333 0.635 1.91
-0.600 0.600 27.4 72.6 0.301 0.609 2.02
-0.650 0.650 25.8 74.2 0.273 0.583 2.14
-0.700 0.700 24.2 75.8 0.247 0.558 2.26
-0.750 0.750 22.7 77.3 0.223 0.534 2.39
-0.800 0.800 21.2 78.8 0.202 0.511 2.53
-0.850 0.850 19.8 80.2 0.183 0.489 2.68
-0.900 0.900 18.4 81.6 0.165 0.467 2.83
-0.950 0.950 17.1 82.9 0.150 0.447 2.99
-1.000 1.000 15.9 84.1 0.135 0.427 3.15
-1.050 1.050 14.7 85.3 0.122 0.408 3.33
-1.100 1.100 13.6 86.4 0.111 0.390 3.52
-1.150 1.150 12.5 87.5 0.100 0.372 3.71
-1.200 1.200 11.5 88.5 0.091 0.356 3.92
-1.250 1.250 10.6 89.4 0.082 0.340 4.14
-1.300 1.300 9.7 90.3 0.074 0.325 4.37
-1.350 1.350 8.9 91.1 0.067 0.310 4.62
-1.400 1.400 8.1 91.9 0.061 0.297 4.88
-1.450 1.450 7.4 92.6 0.055 0.283 5.15
-1.500 1.500 6.7 93.3 0.050 0.271 5.44
-1.550 1.550 6.1 93.9 0.045 0.259 5.74
-1.600 1.600 5.5 94.5 0.041 0.247 6.07
-1.650 1.650 4.9 95.1 0.037 0.236 6.41
-1.700 1.700 4.5 95.5 0.033 0.226 6.77
-1.750 1.750 4.0 96.0 0.030 0.216 7.14
-1.800 1.800 3.6 96.4 0.027 0.206 7.55
-1.850 1.850 3.2 96.8 0.025 0.197 7.97
-1.900 1.900 2.9 97.1 0.022 0.188 8.42
-1.950 1.950 2.6 97.4 0.020 0.180 8.89
-2.000 2.000 2.3 97.7 0.018 0.172 9.39
-2.050 2.050 2.0 98.0 0.017 0.164 9.91

From this table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to happen in real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Hence, the comparison of  surgeon performance based on these random effect predictions is likely to be misleading due to over-shrinkage.

Where to go from here?

ProPublica should be congratulated for taking up such an ambitious, and ultimately useful project. However, the limitations of the adopted approach should make one very skeptical about accepting the inferences from their modeling tool. In particular, the small number of observations per surgeon limits the utility of the predicted random effects to directly compare surgeons due to over-shrinkage. Further studies are required before one could use the results of mixed effects modeling for this application. Based on some limited simulation experiments (that we do not present here), it seems that relative rankings of surgeons may be robust measures of surgical performance, at least compared to the absolute rates used by the Scorecard. Adding my voice to that of Dr Schloss, I think it is time for an open and transparent dialogue (and possibly a “crowdsourced” research project) to better define the best measure of surgical performance given the limitations of the available data. Such a project could also explore other directions, e.g. the explicit handling of zero inflation and even go beyond the ubiquitous bell shaped curve. By making the R code available, I hope that someone (possibly ProPublica) who can access more powerful computational resources can perform more extensive simulations. These may better define other aspects of the modeling approach and suggest improvements in the scorecard methodology. In the meantime, it is probably a good idea not to exclusively rely on the numerical measures of the scorecard when picking up the surgeon who will perform your next surgery.

 

 

Shrinkage of predicted random effects in logistic regression models

July 30, 2015

As a follow-up of our initial investigation of the bias of random effect predictions in generalized linear mixed regression models, I undertook a limited simulation experiment. In this experiment, we varied the population average complication rate and the associated standard deviation of the individual random effect and generated a small number of replications (N=200) for each combination of population mean and standard deviation. Limited by the computational resources of (the otherwise wonderful!) tablet, I had to forego a more thorough and more precise (larger number of replicates) examination of bias, variance and coverage. The focus is on general patterns rather than precise estimates; by making the code public, these can be obtained by anyone who has access to the R computing environment.

The basic code that simulates  from these logistic regression models is shown below. Although we only examine small cluster sizes (number of observations per cluster 20-100) and a moderate number of individual random effects (200) it is straightforward to modify these aspects of the simulation study. For these simulations we examined a limited number of values for the standard deviation of the random effects (0.1, 0.2 and 0.5) and overall complication rates (0.05, 0.10, 0.15, 0.20) to reflect the potential range of values compatible with the raw data in the Surgeon Scorecard.

library(lme4)
library(mgcv)
## helper functions
logit<-function(x) log(x/(1-x))
invlogit<-function(x) exp(x)/(1+exp(x))

## code allows a sensitivity analysis to the order of the nAGQ
## explanation of symbols/functions appearing inside the
## simulation function
## fit: glmer fit using a X point nAGQ integration rule
## ran: random effects using a X point nAGQ integration rule
##   cR: correlation between fitted and simulated random effect
##       the fitted were obtained with a X point nAGQ integrator
## int: intercept obtained with a X point nAGQ integration rule
## biG: bias of the fitted v.s. the simulated random effects
##      this is obtained as the intercept of a GAM regression of
##      the fitted versus the simulated values
## bi0: bias of the intercept from the glmer fit
## sdR: estimated standard deviation of the random eff distro
## bs0: bias of the standard deviation of the random effects (SDR)
## edf: non-linearity of the slope of a GAM regression of the
##      fitted v.s. the simulated residuals
## slQ: derivative of the non-linear slope at the Qth quantile
simFit<-function(X,complications,ind,pall,SDR,Q=0:10/10) {
fit<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=X)
ran<-ranef(fit)[["id"]][,1] ## predicted (mode) of random effects
cR<-cor(ran,ind)
int<-fixef(fit)
## regress on the residuals to assess performance
fitg<-gam(ran~s(ind,bs="ts")) ## gam fit
edf<-sum(fitg$edf[-1])## edf of the non-linear "slope"
biG<-fitg$coef[1]
bi0<-int-logit(pall)
sdR<-sqrt(as.data.frame(VarCorr(fit))$vcov)
bs0<-sdR-SDR
sQ<-quantile(ind,prob=Q)
slQ<-(predict(fitg,newdata=data.frame(ind=sQ+0.0001),type="terms")[,1]-
predict(fitg,newdata=data.frame(ind=sQ),type="terms")[,1])/0.0001
names(slQ)<-paste("slQ",Q,sep="")
ret<-cbind(int=int,edf=edf,biG=biG,bi0=bi0,bs0=bs0,sdR=sdR,cR=cR,t(slQ),pall=pall,SDR=SDR)
}

## function to simulate cases
simcase<-function(N,p) rbinom(N,1,p)
## simulation scenarios: fixed for each simulation
Nsurgeon<-200; # number of surgeons
Nmin<-20; # min number of surgeries per surgeon
Nmax<-100; # max number of surgeries per surgeon
Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon

## simulate individual surgical performance
## the reality consists of different combos of pall and the sd of the
## random effect

Nscenariosim<-200 ## number of simulations per scenario
scenarios<-expand.grid(iter=1:Nscenariosim,pall=seq(0.05,.20,.05),
sd=c(0.1,0.2,.5))

## simulate indivindual scenarios
simScenario<-function(seed,pall,sd,X,Q)
{
set.seed(seed)
ind<-rnorm(Nsurgeon,0,sd) ; # surgical random effects
logitind<-logit(pall)+ind ; # convert to logits
pind<-invlogit(logitind); # convert to probabilities

complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)),
id=factor(do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim))))
simFit(X,complications,ind,pall,sd,Q)
}
X<-0
Q=0:10/10

system.time(perf<-mapply(simScenario,scenarios$iter,scenarios$pall,scenarios$sd,
MoreArgs=list(X=X,Q=Q),SIMPLIFY=FALSE))

perf<-do.call(rbind,perf)

The datasets are generated by the function simScenario, which when applied over all combinations of population mean(pall, treated as a fixed effect) and random effect standard deviation (sd) generates the synthetic dataset (‘complications’ variable). Once the data have been generated the function simFit receives the synthetic dataset, fits the mixed logistic regression model and generates random effect predictions.

To assess and quantify shrinkage in these regression models, we compare the predicted random effects against the simulated values. Due to the potentially large number of these effects (10s of thousands in the Surgeon Score Card), we developed a methodology that looks for patterns in these bivariate relationships, taking into account the magnitude of each simulated and predicted random effect pair. This methodology rests on flexible regression between the predicted against the simulated random effect in each of the synthetic datasets. This flexible regression, using smoothing splines, generalizes linear regression and thus avoids the assumption that the shrinkage (as assessed by the slope of the linear regression line) is the same irrespective of the value of the simulated random effect. By numerically differentiating the spline at various quantiles of the simulated random effect, we thus have an estimate of shrinkage. In particular, this estimate gives the change in value of the predicted random effect for a unit change in the simulated one. The relevant code that fits the smoothing spline and differentiates it in (user defined) grid of random effects quantiles is shown below:

</pre>
<pre>fitg<-gam(ran~s(ind,bs="ts")) ## gam fit
edf<-sum(fitg$edf[-1])## edf of the non-linear "slope"
biG<-fitg$coef[1]
bi0<-int-logit(pall)
sdR<-sqrt(as.data.frame(VarCorr(fit))$vcov)
bs0<-sdR-SDR
sQ<-quantile(ind,prob=Q)
slQ<-(predict(fitg,newdata=data.frame(ind=sQ+0.0001),type="terms")[,1]-
predict(fitg,newdata=data.frame(ind=sQ),type="terms")[,1])/0.0001
names(slQ)<-paste("slQ",Q,sep="")

A representative smoothing spline analysis of a mixed logistic regression model is shown in the figure below, along with the corresponding linear fit, and the line signifying zero shrinkage: a straight line passing from the origin with a unit slope. If the latter relationship is observed, then the predicted random effects are unbiased with respect to the simulated ones.

 

Predicted (y-axis) v.s. simulated (x-axis) random effects in a logistic regression model. 2000 random effects were simulated with small cluster sizes (20-100). Shown are the flexible spline fits (green), linear regression fit (red) and the line of no shrinkage (blue)

This figure shows the potential for random effect predictions to be excessively shrunken relative to the simulated ones (note that the blue line is more horizontal relative to the blue one). One should also observe that the amount of shrinkage is not the same throughout the range of the random effects: positive values (corresponding to individuals with higher complication rates) are shrank relatively less (the green line is closer to the blue line), relative to negative random effects (complication rates less than the population average). The net effect of this non-linear relationship is that the separation between “good” (negative RE) and “bad” (positive values of the RE) is decreased in the predicted relative to the simulated random effects. This can be a rather drastic compression in dynamic range for small cluster sizes as we will see below. 

From each mixed logistic regression model we obtain a large number of information : i) the bias in the fixed effect estimate, ii) the bias in the population standard deviation, iii) the bias in the overall relationship between predicted and simulated residuals (indirectly assessing whether the random effects are centered around the fixed effect) iv) the non-linearity of the relationship between the predicted and the simulated random effects (this is given by the estimated degrees of freedom of the spline, with an edf = 1 signifying a linear fit and higher degrees of freedom non linear relationships; edfs that are less than one signify a more complex pattern of shrinkage of a variable proportion of the random effects to a single point, i.e. the fixed effect) v) the linear slope

There was no obvious bias in the estimation of the intercept (the only fixed effect in our simulation study) for various combinations of overall event rate and random effect standard deviation (but note higher dispersion for large standard deviation values, indicating the need for a higher number of simulation replicates in order to improve precision):

Bias in the estimation of the intercept in the mixed logistic regression model (shown in log scale)

Bias in the estimation of the intercept in the mixed logistic regression model (shown in log scale)

 

Similarly there was no bias in the estimation of the population standard deviation of the random effects:

Bias in the standard deviation of the random effects

Bias in the standard deviation of the random effects

The estimated relationship between predicted and simulated residuals was in linear for approximately 50% of simulated samples for small to moderate standard deviations. However it was non-linear (implying differential shrinkage according to random effect magnitude) for larger standard deviations, typical of the standard deviation of the surgeon random effect in the Surgeon Score Card.

Estimated degrees of freedom (a measure of non-linearity) in the relationship between predicted and simulated random effects over simulation replicates.

Estimated degrees of freedom (a measure of non-linearity) in the relationship between predicted and simulated random effects over simulation replicates. The red horizontal line corresponds to a linear relationship between predicted and simulated random effects (edf=1).Edf<1 signify the shrinkage of a variable proportion of random effects to zero

 

The magnitude of the slopes of the smoothing splines at different quartiles of the random effects distribution and for combinations of overall rate and random effect standard deviation are shown below:

Slope of the smoothing spline at different quantiles (20%, 50%, 80%) of the random effect distribution. Boxplots are over 200 replicates of each combination of overall rate and random effect standard deviation

Slope of the smoothing spline at different quantiles (20%, 50%, 80%) of the random effect distribution. Boxplots are over 200 replicates of each combination of overall rate and random effect standard deviation

There are several things to note in this figure:

  1. the amount of shrinkage decreases as the population standard deviation increases (going from left to right in each row of the figure the slopes increase from zero towards one)
  2. the amount of shrinkage decreases as the overall average increases for the same value of the standard deviation (going from top to bottom in the same column)
  3. the degree of shrinkage appears to be the same across the range of random effect quantiles for small to moderate values of the population standard deviation
  4. the discrepancy between the degree of shrinkage is maximized for larger values of the standard deviation of the random effects and small overall rates (top right subpanel). This is the situation that is more relevant for the Surgeon Score Card based on the analyses reported by this project.

What are the practical implications of these observations for the individual surgeon comparison reported in the Score Card project? These are best understood by a numerical example from one of these 200 hundred datasets shown in the top right subpanel. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios, irrespective of whether they are predicted or (the simulated) true effects. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good Bad Quantile (Good) Quantile (Bad) True OR Pred OR Shrinkage Factor
-0.050 0.050 48.0 52.0 0.905 0.959 1.06
-0.100 0.100 46.0 54.0 0.819 0.920 1.12
-0.150 0.150 44.0 56.0 0.741 0.883 1.19
-0.200 0.200 42.1 57.9 0.670 0.847 1.26
-0.250 0.250 40.1 59.9 0.607 0.813 1.34
-0.300 0.300 38.2 61.8 0.549 0.780 1.42
-0.350 0.350 36.3 63.7 0.497 0.749 1.51
-0.400 0.400 34.5 65.5 0.449 0.719 1.60
-0.450 0.450 32.6 67.4 0.407 0.690 1.70
-0.500 0.500 30.9 69.1 0.368 0.662 1.80
-0.550 0.550 29.1 70.9 0.333 0.635 1.91
-0.600 0.600 27.4 72.6 0.301 0.609 2.02
-0.650 0.650 25.8 74.2 0.273 0.583 2.14
-0.700 0.700 24.2 75.8 0.247 0.558 2.26
-0.750 0.750 22.7 77.3 0.223 0.534 2.39
-0.800 0.800 21.2 78.8 0.202 0.511 2.53
-0.850 0.850 19.8 80.2 0.183 0.489 2.68
-0.900 0.900 18.4 81.6 0.165 0.467 2.83
-0.950 0.950 17.1 82.9 0.150 0.447 2.99
-1.000 1.000 15.9 84.1 0.135 0.427 3.15
-1.050 1.050 14.7 85.3 0.122 0.408 3.33
-1.100 1.100 13.6 86.4 0.111 0.390 3.52
-1.150 1.150 12.5 87.5 0.100 0.372 3.71
-1.200 1.200 11.5 88.5 0.091 0.356 3.92
-1.250 1.250 10.6 89.4 0.082 0.340 4.14
-1.300 1.300 9.7 90.3 0.074 0.325 4.37
-1.350 1.350 8.9 91.1 0.067 0.310 4.62
-1.400 1.400 8.1 91.9 0.061 0.297 4.88
-1.450 1.450 7.4 92.6 0.055 0.283 5.15
-1.500 1.500 6.7 93.3 0.050 0.271 5.44
-1.550 1.550 6.1 93.9 0.045 0.259 5.74
-1.600 1.600 5.5 94.5 0.041 0.247 6.07
-1.650 1.650 4.9 95.1 0.037 0.236 6.41
-1.700 1.700 4.5 95.5 0.033 0.226 6.77
-1.750 1.750 4.0 96.0 0.030 0.216 7.14
-1.800 1.800 3.6 96.4 0.027 0.206 7.55
-1.850 1.850 3.2 96.8 0.025 0.197 7.97
-1.900 1.900 2.9 97.1 0.022 0.188 8.42
-1.950 1.950 2.6 97.4 0.020 0.180 8.89
-2.000 2.000 2.3 97.7 0.018 0.172 9.39
-2.050 2.050 2.0 98.0 0.017 0.164 9.91

From these table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to be observedin real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Furthermore, the comparison of  surgeon performance based on predicted random effects is likely to be misleading due to over-shrinkage.

 

 

 

 

Empirical bias analysis of random effects predictions in linear and logistic mixed model regression

July 30, 2015

In the first technical post in this series, I conducted a numerical investigation of the biasedness of random effect predictions in generalized linear mixed models (GLMM), such as the ones used in the Surgeon Scorecard, I decided to undertake two explorations: firstly, the behavior of these estimates as more and more data are gathered for each individual surgeon and secondly whether the limiting behavior of these estimators critically depends on the underlying GLMM family. Note that the first question directly assesses whether the random effect estimators reflect the underlying (but unobserved) “true” value of the individual practitioner effect in logistic regression models for surgical complications. On the other hand the second simulation examines a separate issue, namely whether the non-linearity of the logistic regression model affects the convergence rate of the random effect predictions towards their true value.

For these simulations we will examine three different ranges of dataset sizes for each surgeon:

  • small data (complication data from between 20-100 cases/ surgeon are available)
  • large data (complications from between 200-1000 cases/surgeon)
  • extra large data (complications from between 1000-2000 cases/surgeon)

We simulated 200 surgeons (“random effects”) from a normal distribution with a mean of zero and a standard deviation of 0.26, while the population average complication rate was set t0 5%. These numbers were chosen to reflect the range of values (average and population standard deviation) of the random effects in the Score Card dataset, while the use of 200 random effects was a realistic compromise with the computational capabilities of the Asus Transformer T100 2 in 1 laptop/tablet that I used for these analyses.

The following code was used to simulate the logistic case for small data (the large and extra large cases were simulated by changing the values of the Nmin and Nmax variables).

library(lme4)
library(mgcv)
## helper functions
logit<-function(x) log(x/(1-x))
invlogit<-function(x) exp(x)/(1+exp(x))

## simulate cases
simcase<-function(N,p) rbinom(N,1,p)
## simulation scenario
pall<-0.05; # global average
Nsurgeon<-200; # number of surgeons
Nmin<-20; # min number of surgeries per surgeon
Nmax<-100; # max number of surgeries per surgeon

## simulate individual surgical performance
## how many simulations of each scenario
set.seed(123465); # reproducibility
ind<-rnorm(Nsurgeon,0,.26) ; # surgical random effects
logitind<-logit(pall)+ind ; # convert to logits
pind<-invlogit(logitind); # convert to probabilities
Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon

complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)),
id=do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim)))
complications$id<-factor(complications$id)

 

A random effect and fixed effect model were fit to these data (the fixed effect model is simply a series of independent fits to the data for each random effect):


## Random Effects

fit2<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=2)
ran2<-ranef(fit2)[["id"]][,1]
c2<-cor(ran2,ind)
int2<-fixef(fit2)
ranind2<-ran2+int2

## Fixed Effects

fixfit<-vector("numeric",Nsurgeon)
for(i in 1:Nsurgeon) {
fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="binomial")$coef[1]
}

The corresponding Gaussian GLMM cases were simulated by making minor changes to these codes. These are shown below:


simcase<-function(N,p) rnorm(N,p,1)

fit2<-glmer(ev~1+(1|id),data=complications,nAGQ=2)

fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="gaussian")$coef[1]

The predicted random effects were assessed against the simulated truth by smoothing regression splines. In these regressions, the intercept yields the bias of the average of the predicted random effects vis-a-vis the truth, while the slope of the regression quantifies the amount of shrinkage effected by the mixed model formulation. For unbiased estimation not only would we like the intercept to be zero, but also the slope to be equal to one. In this case, the predicted random effect would be equal to its true (simulated) value. Excessive shrinkage would result in a slope that is substantially different from one. Assuming that the bias (intercept) is not different from zero, the relaxation of the slope towards one quantifies the consistency and the bias (or rather its rate of convergence) of these estimators using simulation techniques (or so it seems to me).

The use of smoothing (flexible), rather than simple linear regression, to quantify these relationships does away with a restrictive assumption: that the amount of shrinkage is the same throughout the range of the random effects:

## smoothing spline (flexible) fit
fitg<-gam(ran2~s(ind)
## linear regression
fitl<-lm(ran2~ind)

The following figure shows the results of the flexible regression (black with 95% CI, dashed black) v.s. the linear regression (red) and the expected (blue) line (intercept of zero, slope of one).

Predicted v.s. simulated random effects for logistic and linear mixed regression as a function of the number of observations per random effect (cluster size)

Predicted v.s. simulated random effects for logistic and linear mixed regression as a function of the number of observations per random effect (cluster size)

Several observations are worth noting in this figure.
First
, the flexible regression was indistinguishable from a linear regression in all cases; hence the red and black lines overlap. Stated in other terms, the amount of shrinkage was the same across the range of the random effect values.
Second, the intercept in all flexible models was (within machine precision) equal to zero. Consequently, when estimating a group of random effects their overall mean is (unbiasedly) estimated.
Third, the amount of shrinkage of individual random effects appears to be excessive for small sample sizes (i.e. few cases per surgeon). Increasing the number of cases decreases the shrinkage, i.e. the black and red lines come closer to the blue line as N is increased from 20-100 to 1000-2000. Conversely, for small cluster sizes the amount of shrinkage is so excessive that one may lose the ability to distinguish between individuals with very different complication rates. This is reflected by a regression line between the predicted and the simulated random effect value that is nearly horizontal.
Fourth, the rate of convergence of the predicted random effects to their true value critically depends upon the linearity of the regression model. In particular, the shrinkage of logistic regression model with 1000-2000 observations per case is almost the same at that of a linear model with 20-100 for the parameter values considered in this simulation.

An interesting question is whether these observations (overshrinkage of random effects from small sample sizes in logistic mixed regression) reflects the use of random effects in modeling, or whether they are simply due to the interplay between sample size and the non-linearity of the statistical model. Hence, I turned to fixed effects modeling of the same datasets. The results of these analyses are summarized in the following figure:

Difference between fixed effect estimates of random effects(black histograms) v.s. random effects predictions (density estimators: red lines) relative to their simulated (true) values

Difference between fixed effect estimates of random effects(black histograms) v.s. random effects predictions (density estimators: red lines) relative to their simulated (true) values

One notes that the distribution of the differences between the random and fixed effects relative to the true (simulated) values is nearly identical for the linear case (second row). In other words, the use of the implicit constraint of the mixed model, offers no additional advantage when estimating individual performance in this model. On the other hand there is a value in applying mixed modeling techniques for the logistic regression case. In particular, outliers (such as those arising for small samples) are eliminated by the use of random effect modeling. The difference between the fixed and the random effect approach progressively decreases for large sample sizes, implying that the benefit of the latter approach is lost for “extra large” cluster sizes.

One way to put these differences into perspective is to realize that the random effects for the logistic model correspond to log-odd ratios, relative to the population mean. Hence the difference between the predicted random effect and its true value, when exponentiated, corresponds to an Odd Ratio (OR). A summary of the odds ratios over the population of the random effects as a function of cluster size is shown below.


Metric 20-100  200-1000 1000-2000
Min    0.5082   0.6665    0.7938
Q25    0.8901   0.9323    0.9536
Median 1.0330   1.0420    1.0190 
Mean   1.0530   1.0410    1.0300  
Q75    1.1740   1.1340    1.1000   
Max    1.8390   1.5910    1.3160 

 

Even though the average Odds Ratio is close to 1, a substantial number of predicted random effects are far from the true value and yield ORs that are greater than 11% in either direction for small cluster sizes. These observations have implications for the Score Card (or similar projects): if one were to use Random Effects modeling to focus on individuals, then unless the cluster sizes (observations per individual) are substantial, one would run a substantial risk of misclassifying individuals, even though one would be right on average!

One could wonder whether these differences between the simulated truth and the predicted random effects arise as a result of the numerical algorithms of the lme4 package. The latter was used by both the Surgeon Score Card project and our simulations so far and thus it would be important to verify that it performs up to specs. The major tuning variable for the algorithm is the order of the Adaptive Gaussian Quadrature (argument nAGQ). We did not find any substantial departures when the order of the quadrature was varied from 0 to 1 and 2. However, there is a possibility that the algorithm fails for all AGQ orders as it has to calculate probabilities that are numerically close to the boundary of the parameter space. We thus decided to fit the same model from a Bayesian perspective using Markov Chain Monte Carlo (MCMC) methods. The following code will fit the Bayesian model and graphs the true values of the effects used in the simulated dataset against the Bayesian estimates (the posterior mean) and also the lme4 predictions. The latter tend to be numerically close to the posterior mode of the random effects when a Bayesian perspective is adopted.


## Fit the mixed effects logistic model from R using openbugs

library("glmmBUGS")
library(nlme)
fitBUGS = glmmBUGS(ev ~ 1, data=complications, effects="id", family="bernoulli")
startingValues = fitBUGS$startingValues
source("getInits.R")
require(R2WinBUGS)
fitBUGSResult = bugs(fitBUGS$ragged, getInits, parameters.to.save = names(getInits()),
model.file="model.txt", n.chain=3, n.iter=6000, n.burnin=2000, n.thin=10,
program="openbugs", working.directory=getwd())

fitBUGSParams = restoreParams(fitBUGSResult , fitBUGS$ragged)
sumBUGS<-summaryChain(fitBUGSParams )
checkChain(fitBUGSParams )

## extract random effects

cnames<-as.character(sort(as.numeric(row.names(sumBUGS$FittedRid))))
fitBUGSRE<-sumBUGS$Rid[cnames,1]

## plot against the simulated (true) effects and the lme4 estimates

hist(ind,xlab="RE",ylim=c(0,3.8),freq=FALSE,main="")
lines(density(fitBUGSRE),main="Bayesian",xlab="RE",col="blue")
lines(density(ran2),col="red")
legend(legend=c("Truth","lme4","MCMC"),col=c("black","red","blue"),
bty="n",x=0.2,y=3,lwd=1)

The following figure shows the histogram of the true values of the random effects (black), the frequentist(lme4) estimates (red) and the Bayesian posterior means (blue).

 

MCMClme4Truth

It can be appreciated that both the Bayesian estimates and the lme4 predictions demonstrate considerable shrinkage relative to the true values for small cluster sizes (20-100). Hence, an lme4 numerical quirk seems an unlikely explanation for the shrinkage observed in the simulation.

Summing up:

  • Random Effect modeling of binomial outcomes require a substantial number of observations per individual (cluster size) for the procedure to yield estimates of individual effects that are numerically indistinguishable  from the true values
  • Fixed effect modeling is even worse an approach for this problem
  • Bayesian fitting procedures do not appear to yield numerically different effects from their frequentist counterparts

These features should raise the barrier for accepting a random effects logistic modeling approach when the focus is on individual rather than population average effects. Even though the procedure is certainly preferable to fixed effects regression, the direct use of the value of the predicted individual random effects as an effect measure will be problematic for small cluster sizes (e.g. a small number of procedures per surgeon). In particular, a substantial proportion of these estimated effects is likely to be far from the truth even if the model is unbiased on the average. These observations are of direct relevance to the Surgical Score Card in which the number of observations per surgeon were far lower than the average value in our simulations: 60 (small), 600 (large) and 1500 (extra large):

Procedure Code N (procedures) N(surgeons) Procedures per surgeon
51.23 201,351 21,479 9.37
60.5 78,763 5,093 15.46
60.29 73,752 7,898 9.34
81.02 52,972 5,624 9.42
81.07 106,689 6,214 17.17
81.08 102,716 6,136 16.74
81.51 494,576 13,414 36.87
81.54 1,190,631 18,029 66.04
Total 2,301,450 83,887 27.44