Excellent discussion and some nice references about Bayesian modelling in the context of model misspecification

]]>

]]>

- a = min of the data
- b = max of the data
- m = median
- n = size of the sample

the formulas are as follows:

**Mean **

**Variance **

The following R function will carry out these calculations

f<-function(a,m,b,n)

{

mn<-(a+2*m+b)/4+(a-2*m+b)/(4*n)

s=sqrt((a*a+m*m+b*b+(n-3)*((a+m)^2+(m+b)^2)/8-n*mn*mn)/(n-1))

c(mn,s)

}

Edit

Surfing around arxiv, I found another paper that handles additional scenarios and proposes alternative formulas

]]>

(You may stop reading now – or you may read past the rant in the next paragraph)

**Rant Mode On: **Another criticism that I received, is that I fail to understand random effects modeling, which I personally thought it was funny because one of my recent papers actually says one should scrap the Cox survival model for generalized additive models (which are just generalized linear mixed models in disguise). In any case, assuming that my understanding of mixed models is poor, maybe my conclusion that these models are to be preferred for such applications is also problematic?

Since one may consider my vote-of-confidence-to-the-almighty-mix-model v.s. my criticism of their application in the Scorecard project, subtle evidence for paranoid schizophrenia let me sum up how one can simultaneously hold these beliefs:

**Even though mixed models are to be preferred especially as more data accumulate(a point I make clear in the 3 blog posts I wrote), no modeling could overcome the severe lack of events in the scorecard database.**

There are other technical issues that one could “rant” about e.g. how were the hospital and surgeon effects were combined, but I will not digress any further.

**Rant mode off**

**Back to criticism: **In my opinion a major reason that mixed models were used is the potentially large number of surgeons with zero complications in their limited observation records*. The use of shrinkage models allow one to report performance for such surgeons, and generate nice colorful graphs (with one and possibly two decimal points) about their performance instead of reporting a NaN (Not A Number). **Incidentally, the shrinkage of the individual surgeon effects all the way towards the population mean is the mixed model’s way of telling you the same thing: since it could not estimate how these people actually performed, it might as well give back the only answer consistent with the model applied, i.e. everybody is (almost) like everybody else and thus everyone is average. **

*Curiously this information, i.e. the number of surgeons without any complication in the source database is not provided, although I’d have thought it would be important to report this number in the shake of transparency.

**Word of caution: **For surgeons without *any complications in the database*, the actual information comes not from their complications but from the number of uncomplicated surgeries they performed. This raises the question of **inclusion effects** (a point Andrew Gelman thoroughly address in his book) and the **associated selection biases** inherent in comparing surgeons who accept v.s. those who do not accept Medicare (or accept too little of it) and the corresponding **social determinants of health outcomes** (beautifully explained here).

**A non-statistical criticism: **As a nephrologist, I find it amusingly insulting that Acute Kidney Injury (coded as Acute Renal Failure) was not considered a surgical complication, even though it certainly complicates surgeries. Surgery is in fact the leading cause of “hospital-acquired” AKI (40% of cases are preceded by a surgical procedure). But maybe, as Benjamin Davies pointed out, the real end-points are not really measured by the Scorecard. Or I could be just delusional when I point out to surgical colleagues that the selection of analgesia, fluid management and pre as well as post perioperative care DO play a role for some complications. Or it could be just the anesthesiologists’s fault

]]>

Two weeks ago, the world of medical journalism was rocked by the public release of ProPublica’s Surgeon Scorecard. In this project ProPublica “*calculated death and complication rates for surgeons performing one of eight elective procedures in Medicare, carefully adjusting for differences in patient health, age and hospital quality*.” By making the dataset available through a user friendly portal, the intended use of this public resource was to “*use this database to know more about a surgeon before your operation*“.

The Scorecard was met with great enthusiasm and coverage by non-medical media. TODAY.com headline “nutshelled” the Scorecard as a resource that “aims to help you find doctors with lowest complication rates“. A (?tongue in cheek) NBC headline tells us the scorecard “It’s complicated“. On the other hand the project was not well received by my medical colleagues. John Mandrola gave it a failing grade in Medscape. Writing at KevinMD.com, Jeffrey Parks called it a journalistic low point for ProPublica. Jha Shaurabh pointed out a potential paradox in a statistically informed, easy to read and highly entertaining piece. In this paradox, the surgeon with the higher complication case who takes high risk patients from a disadvantaged socio-economic environment, may actually be the surgeon one wants to perform one’s surgery! Ed Schloss summarized the criticism (in the blogosphere and twitter) in an open letter and asked for peer review of the Scorecard methodology.

The criticism to date has largely focused on the potential for selection effects (as the Scorecard is based on Medicare data, and does not include data from private insurers), the incomplete adjustment for confounders, the paucity of data for individual surgeons, the counting of complications and re-admission rates, decisions about risk category classification boundaries and even data errors (ProPublica’s response arguing that the Scorecard matters may be found here). With a few exceptions (e.g. see Schloss’s blogpost in which the complexity of the statistical approach is mentioned) the criticism of the statistical approach (including my own comments in twitter) has largely focused on these issues.

On the other hand, the underlying statistical methodology (here and there) that powers the Scorecard has not received much attention. Therefore I undertook a series of simulation experiments to explore the impact of the statistical procedures on the inferences afforded by the Scorecard.

ProPublica’s approach to the scorecard is based on logistic regression model, in which individual surgeon (and hospital) performance (probability of suffering a complication) is modelled using Gaussian *random effects*, while patient level characteristics that may act as confounders are adjusted for, using *fixed effects*. In a nutshell this approach implies fitting a model of the average complication rate that is function of the fixed effects (e.g. patient age) for the entire universe of surgeries performed in the USA. Individual surgeon and hospital factors modify this complication rate, so that a given surgeon and hospital will have an individual rate that varies around the population average. These individual surgeon and hospital factors are constrained to follow Gaussian, bell-shaped distribution when analyzing complication data. After model fitting, these *predicted random effects* are used to quantify and compare surgical performance. A feature of mixed modeling approaches is the unavoidable *shrinkage* of the raw complication rate towards the population mean. Shrinkage implies that the dynamic range of the actually observed complication rates is compressed. This is readily appreciated in the figures generated by the ProPublica analytical team:

In their methodological white paper the ProPublica team notes:

“**While raw rates ranged from 0.0 to 29%, the Adjusted Complication Rate goes from 1.1 to ****5.7%**. …. *shrinkage is largely a result of modeling in the first place*, not due to adjusting for case mix. This shrinkage is another piece of the measured approach we are taking: we are taking care not to unfairly characterize surgeons and hospitals.”

**These features should alert us that something is going on. For if a model can distort the data to such a large extent, then the model should be closely scrutinized before being accepted. In fact, given these observations, it is possible that one mistakes the noise from the model for the information hidden in the empirical data. Or, even more likely, that one is not using the model in the most productive manner.**

Note that these comments should not be interpreted as a criticism against the use of mixed models in general, or even for the particular aspects of the Scorecard project. They are rather a call for re-examining the modeling assumptions and for gaining a better understanding of the model “mechanics of prediction” before releasing the Kraken to the world.

There are many technical aspects one could potentially misfire in a Generalized Linear Mixed Model for complication rates. Getting the wrong shape of the random effects distribution ~~is of ~~ may or may not be of concern (e.g. assuming it is bell shaped when it is not). Getting the underlying model wrong, e.g. assuming the binomial model for complication rates while a model with many more zeros (a zero inflated model) may be more appropriate, is yet another potential problem area. However, even if these factors are not operational, then one may still be misled when using the results of the model. In particular, the major area of concern for such models is the *cluster size*: the number of observations per individual random effect (e.g. surgeon) in the dataset. It is this factor, rather than the actual size of the dataset that determines the precision in the individual random affects. Using a toy example, we show that the number of observations per surgeon typical of the Scorecard dataset, leads to predicted random effects that may be far from their true value. This seems to stem from the non-linear nature of the logistic regression model. As we conclude in our first technical post:

- Random Effect modeling of binomial outcomes require
**a substantial number of observations per individual (in the order of thousands)**for the procedure to yield estimates of individual effects that are numerically indistinguishable from the true values.

Contrast this conclusion to the cluster size in the actual scorecard:

Procedure Code |
N (procedures) |
N(surgeons) |
Procedures per surgeon |

51.23 | 201,351 | 21,479 | 9.37 |

60.5 | 78,763 | 5,093 | 15.46 |

60.29 | 73,752 | 7,898 | 9.34 |

81.02 | 52,972 | 5,624 | 9.42 |

81.07 | 106,689 | 6,214 | 17.17 |

81.08 | 102,716 | 6,136 | 16.74 |

81.51 | 494,576 | 13,414 | 36.87 |

81.54 | 1,190,631 | 18,029 | 66.04 |

Total |
2,301,450 | 83,887 | 27.44 |

In a follow up simulation study we demonstrate that this feature results in predicted individual effects that are non-uniformly shrank towards their average value. This compromises the ability of mixed model predictions to separate the good from the bad “apples”.

In the second technical post, we undertook a simulation study to understand the implications of over-shrinkage for the Scorecard project. These are best understood by a numerical example from one of simulated datasets. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good |
Bad |
Quantile (Good) |
Quantile (Bad) |
True OR |
Pred OR |
Shrinkage Factor |

-0.050 | 0.050 | 48.0 | 52.0 | 0.905 | 0.959 | 1.06 |

-0.100 | 0.100 | 46.0 | 54.0 | 0.819 | 0.920 | 1.12 |

-0.150 | 0.150 | 44.0 | 56.0 | 0.741 | 0.883 | 1.19 |

-0.200 | 0.200 | 42.1 | 57.9 | 0.670 | 0.847 | 1.26 |

-0.250 | 0.250 | 40.1 | 59.9 | 0.607 | 0.813 | 1.34 |

-0.300 | 0.300 | 38.2 | 61.8 | 0.549 | 0.780 | 1.42 |

-0.350 | 0.350 | 36.3 | 63.7 | 0.497 | 0.749 | 1.51 |

-0.400 | 0.400 | 34.5 | 65.5 | 0.449 | 0.719 | 1.60 |

-0.450 | 0.450 | 32.6 | 67.4 | 0.407 | 0.690 | 1.70 |

-0.500 | 0.500 | 30.9 | 69.1 | 0.368 | 0.662 | 1.80 |

-0.550 | 0.550 | 29.1 | 70.9 | 0.333 | 0.635 | 1.91 |

-0.600 | 0.600 | 27.4 | 72.6 | 0.301 | 0.609 | 2.02 |

-0.650 | 0.650 | 25.8 | 74.2 | 0.273 | 0.583 | 2.14 |

-0.700 | 0.700 | 24.2 | 75.8 | 0.247 | 0.558 | 2.26 |

-0.750 | 0.750 | 22.7 | 77.3 | 0.223 | 0.534 | 2.39 |

-0.800 | 0.800 | 21.2 | 78.8 | 0.202 | 0.511 | 2.53 |

-0.850 | 0.850 | 19.8 | 80.2 | 0.183 | 0.489 | 2.68 |

-0.900 | 0.900 | 18.4 | 81.6 | 0.165 | 0.467 | 2.83 |

-0.950 | 0.950 | 17.1 | 82.9 | 0.150 | 0.447 | 2.99 |

-1.000 | 1.000 | 15.9 | 84.1 | 0.135 | 0.427 | 3.15 |

-1.050 | 1.050 | 14.7 | 85.3 | 0.122 | 0.408 | 3.33 |

-1.100 | 1.100 | 13.6 | 86.4 | 0.111 | 0.390 | 3.52 |

-1.150 | 1.150 | 12.5 | 87.5 | 0.100 | 0.372 | 3.71 |

-1.200 | 1.200 | 11.5 | 88.5 | 0.091 | 0.356 | 3.92 |

-1.250 | 1.250 | 10.6 | 89.4 | 0.082 | 0.340 | 4.14 |

-1.300 | 1.300 | 9.7 | 90.3 | 0.074 | 0.325 | 4.37 |

-1.350 | 1.350 | 8.9 | 91.1 | 0.067 | 0.310 | 4.62 |

-1.400 | 1.400 | 8.1 | 91.9 | 0.061 | 0.297 | 4.88 |

-1.450 | 1.450 | 7.4 | 92.6 | 0.055 | 0.283 | 5.15 |

-1.500 | 1.500 | 6.7 | 93.3 | 0.050 | 0.271 | 5.44 |

-1.550 | 1.550 | 6.1 | 93.9 | 0.045 | 0.259 | 5.74 |

-1.600 | 1.600 | 5.5 | 94.5 | 0.041 | 0.247 | 6.07 |

-1.650 | 1.650 | 4.9 | 95.1 | 0.037 | 0.236 | 6.41 |

-1.700 | 1.700 | 4.5 | 95.5 | 0.033 | 0.226 | 6.77 |

-1.750 | 1.750 | 4.0 | 96.0 | 0.030 | 0.216 | 7.14 |

-1.800 | 1.800 | 3.6 | 96.4 | 0.027 | 0.206 | 7.55 |

-1.850 | 1.850 | 3.2 | 96.8 | 0.025 | 0.197 | 7.97 |

-1.900 | 1.900 | 2.9 | 97.1 | 0.022 | 0.188 | 8.42 |

-1.950 | 1.950 | 2.6 | 97.4 | 0.020 | 0.180 | 8.89 |

-2.000 | 2.000 | 2.3 | 97.7 | 0.018 | 0.172 | 9.39 |

-2.050 | 2.050 | 2.0 | 98.0 | 0.017 | 0.164 | 9.91 |

From this table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to happen in real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Hence, **the comparison of surgeon performance based on these random effect predictions is likely to be misleading due to over-shrinkage**.

ProPublica should be congratulated for taking up such an ambitious, and ultimately useful project. However, the limitations of the adopted approach should make one very skeptical about accepting the inferences from their modeling tool. In particular, the small number of observations per surgeon limits the utility of the predicted random effects to directly compare surgeons due to over-shrinkage. Further studies are required before one could use the results of mixed effects modeling for this application. Based on some limited simulation experiments (that we do not present here), it seems that relative rankings of surgeons may be robust measures of surgical performance, at least compared to the absolute rates used by the Scorecard. Adding my voice to that of Dr Schloss, I think it is time for an open and transparent dialogue (and possibly a “crowdsourced” research project) to better define the best measure of surgical performance given the limitations of the available data. Such a project could also explore other directions, e.g. the explicit handling of zero inflation and even go beyond the ubiquitous bell shaped curve. By making the R code available, I hope that someone (possibly ProPublica) who can access more powerful computational resources can perform more extensive simulations. These may better define other aspects of the modeling approach and suggest improvements in the scorecard methodology. In the meantime, it is probably a good idea not to exclusively rely on the numerical measures of the scorecard when picking up the surgeon who will perform your next surgery.

]]>

The basic code that simulates from these logistic regression models is shown below. Although we only examine small cluster sizes (number of observations per cluster 20-100) and a moderate number of individual random effects (200) it is straightforward to modify these aspects of the simulation study. For these simulations we examined a limited number of values for the standard deviation of the random effects (0.1, 0.2 and 0.5) and overall complication rates (0.05, 0.10, 0.15, 0.20) to reflect the potential range of values compatible with the raw data in the Surgeon Scorecard.

library(lme4) library(mgcv) ## helper functions logit<-function(x) log(x/(1-x)) invlogit<-function(x) exp(x)/(1+exp(x)) ## code allows a sensitivity analysis to the order of the nAGQ ## explanation of symbols/functions appearing inside the ## simulation function ## fit: glmer fit using a X point nAGQ integration rule ## ran: random effects using a X point nAGQ integration rule ## cR: correlation between fitted and simulated random effect ## the fitted were obtained with a X point nAGQ integrator ## int: intercept obtained with a X point nAGQ integration rule ## biG: bias of the fitted v.s. the simulated random effects ## this is obtained as the intercept of a GAM regression of ## the fitted versus the simulated values ## bi0: bias of the intercept from the glmer fit ## sdR: estimated standard deviation of the random eff distro ## bs0: bias of the standard deviation of the random effects (SDR) ## edf: non-linearity of the slope of a GAM regression of the ## fitted v.s. the simulated residuals ## slQ: derivative of the non-linear slope at the Qth quantile simFit<-function(X,complications,ind,pall,SDR,Q=0:10/10) { fit<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=X) ran<-ranef(fit)[["id"]][,1] ## predicted (mode) of random effects cR<-cor(ran,ind) int<-fixef(fit) ## regress on the residuals to assess performance fitg<-gam(ran~s(ind,bs="ts")) ## gam fit edf<-sum(fitg$edf[-1])## edf of the non-linear "slope" biG<-fitg$coef[1] bi0<-int-logit(pall) sdR<-sqrt(as.data.frame(VarCorr(fit))$vcov) bs0<-sdR-SDR sQ<-quantile(ind,prob=Q) slQ<-(predict(fitg,newdata=data.frame(ind=sQ+0.0001),type="terms")[,1]- predict(fitg,newdata=data.frame(ind=sQ),type="terms")[,1])/0.0001 names(slQ)<-paste("slQ",Q,sep="") ret<-cbind(int=int,edf=edf,biG=biG,bi0=bi0,bs0=bs0,sdR=sdR,cR=cR,t(slQ),pall=pall,SDR=SDR) } ## function to simulate cases simcase<-function(N,p) rbinom(N,1,p) ## simulation scenarios: fixed for each simulation Nsurgeon<-200; # number of surgeons Nmin<-20; # min number of surgeries per surgeon Nmax<-100; # max number of surgeries per surgeon Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon ## simulate individual surgical performance ## the reality consists of different combos of pall and the sd of the ## random effect Nscenariosim<-200 ## number of simulations per scenario scenarios<-expand.grid(iter=1:Nscenariosim,pall=seq(0.05,.20,.05), sd=c(0.1,0.2,.5)) ## simulate indivindual scenarios simScenario<-function(seed,pall,sd,X,Q) { set.seed(seed) ind<-rnorm(Nsurgeon,0,sd) ; # surgical random effects logitind<-logit(pall)+ind ; # convert to logits pind<-invlogit(logitind); # convert to probabilities complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)), id=factor(do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim)))) simFit(X,complications,ind,pall,sd,Q) } X<-0 Q=0:10/10 system.time(perf<-mapply(simScenario,scenarios$iter,scenarios$pall,scenarios$sd, MoreArgs=list(X=X,Q=Q),SIMPLIFY=FALSE)) perf<-do.call(rbind,perf)

The datasets are generated by the function simScenario, which when applied over all combinations of population mean(pall, treated as a fixed effect) and random effect standard deviation (sd) generates the synthetic dataset (‘complications’ variable). Once the data have been generated the function simFit receives the synthetic dataset, fits the mixed logistic regression model and generates random effect predictions.

To assess and quantify shrinkage in these regression models, we compare the predicted random effects against the simulated values. Due to the potentially large number of these effects (10s of thousands in the Surgeon Score Card), we developed a methodology that looks for patterns in these bivariate relationships, taking into account the magnitude of each simulated and predicted random effect pair. This methodology rests on flexible regression between the predicted against the simulated random effect in each of the synthetic datasets. This flexible regression, using smoothing splines, generalizes linear regression and thus avoids the assumption that the shrinkage (as assessed by the slope of the linear regression line) is the same irrespective of the value of the simulated random effect. By numerically differentiating the spline at various quantiles of the simulated random effect, we thus have an estimate of shrinkage. In particular, this estimate gives the change in value of the predicted random effect for a unit change in the simulated one. The relevant code that fits the smoothing spline and differentiates it in (user defined) grid of random effects quantiles is shown below:

</pre> <pre>fitg<-gam(ran~s(ind,bs="ts")) ## gam fit edf<-sum(fitg$edf[-1])## edf of the non-linear "slope" biG<-fitg$coef[1] bi0<-int-logit(pall) sdR<-sqrt(as.data.frame(VarCorr(fit))$vcov) bs0<-sdR-SDR sQ<-quantile(ind,prob=Q) slQ<-(predict(fitg,newdata=data.frame(ind=sQ+0.0001),type="terms")[,1]- predict(fitg,newdata=data.frame(ind=sQ),type="terms")[,1])/0.0001 names(slQ)<-paste("slQ",Q,sep="")

A representative smoothing spline analysis of a mixed logistic regression model is shown in the figure below, along with the corresponding linear fit, and the line signifying zero shrinkage: a straight line passing from the origin with a unit slope. If the latter relationship is observed, then the predicted random effects are unbiased with respect to the simulated ones.

This figure shows the potential for random effect predictions to be excessively shrunken relative to the simulated ones (note that the blue line is more horizontal relative to the blue one). One should also observe that the amount of shrinkage is not the same throughout the range of the random effects: positive values (corresponding to individuals with higher complication rates) are shrank relatively less (the green line is closer to the blue line), relative to negative random effects (complication rates less than the population average). **The net effect of this non-linear relationship is that the separation between “good” (negative RE) and “bad” (positive values of the RE) is decreased in the predicted relative to the simulated random effects. This can be a rather drastic compression in dynamic range for small cluster sizes as we will see below. **

From each mixed logistic regression model we obtain a large number of information : i) the bias in the fixed effect estimate, ii) the bias in the population standard deviation, iii) the bias in the overall relationship between predicted and simulated residuals (indirectly assessing whether the random effects are centered around the fixed effect) iv) the non-linearity of the relationship between the predicted and the simulated random effects (this is given by the estimated degrees of freedom of the spline, with an edf = 1 signifying a linear fit and higher degrees of freedom non linear relationships; edfs that are less than one signify a more complex pattern of shrinkage of a variable proportion of the random effects to a single point, i.e. the fixed effect) v) the linear slope

There was no obvious bias in the estimation of the intercept (the only fixed effect in our simulation study) for various combinations of overall event rate and random effect standard deviation (but note higher dispersion for large standard deviation values, indicating the need for a higher number of simulation replicates in order to improve precision):

Similarly there was no bias in the estimation of the population standard deviation of the random effects:

The estimated relationship between predicted and simulated residuals was in linear for approximately 50% of simulated samples for small to moderate standard deviations. However it was non-linear (implying differential shrinkage according to random effect magnitude) for larger standard deviations, typical of the standard deviation of the surgeon random effect in the Surgeon Score Card.

The magnitude of the slopes of the smoothing splines at different quartiles of the random effects distribution and for combinations of overall rate and random effect standard deviation are shown below:

There are several things to note in this figure:

- the amount of shrinkage decreases as the population standard deviation increases (going from left to right in each row of the figure the slopes increase from zero towards one)
- the amount of shrinkage decreases as the overall average increases for the same value of the standard deviation (going from top to bottom in the same column)
- the degree of shrinkage appears to be the same across the range of random effect quantiles for small to moderate values of the population standard deviation
- the discrepancy between the degree of shrinkage is maximized for larger values of the standard deviation of the random effects and small overall rates (top right subpanel). This is the situation that is more relevant for the Surgeon Score Card based on the analyses reported by this project.

What are the practical implications of these observations for the individual surgeon comparison reported in the Score Card project? These are best understood by a numerical example from one of these 200 hundred datasets shown in the top right subpanel. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios, irrespective of whether they are predicted or (the simulated) true effects. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good |
Bad |
Quantile (Good) |
Quantile (Bad) |
True OR |
Pred OR |
Shrinkage Factor |

-0.050 | 0.050 | 48.0 | 52.0 | 0.905 | 0.959 | 1.06 |

-0.100 | 0.100 | 46.0 | 54.0 | 0.819 | 0.920 | 1.12 |

-0.150 | 0.150 | 44.0 | 56.0 | 0.741 | 0.883 | 1.19 |

-0.200 | 0.200 | 42.1 | 57.9 | 0.670 | 0.847 | 1.26 |

-0.250 | 0.250 | 40.1 | 59.9 | 0.607 | 0.813 | 1.34 |

-0.300 | 0.300 | 38.2 | 61.8 | 0.549 | 0.780 | 1.42 |

-0.350 | 0.350 | 36.3 | 63.7 | 0.497 | 0.749 | 1.51 |

-0.400 | 0.400 | 34.5 | 65.5 | 0.449 | 0.719 | 1.60 |

-0.450 | 0.450 | 32.6 | 67.4 | 0.407 | 0.690 | 1.70 |

-0.500 | 0.500 | 30.9 | 69.1 | 0.368 | 0.662 | 1.80 |

-0.550 | 0.550 | 29.1 | 70.9 | 0.333 | 0.635 | 1.91 |

-0.600 | 0.600 | 27.4 | 72.6 | 0.301 | 0.609 | 2.02 |

-0.650 | 0.650 | 25.8 | 74.2 | 0.273 | 0.583 | 2.14 |

-0.700 | 0.700 | 24.2 | 75.8 | 0.247 | 0.558 | 2.26 |

-0.750 | 0.750 | 22.7 | 77.3 | 0.223 | 0.534 | 2.39 |

-0.800 | 0.800 | 21.2 | 78.8 | 0.202 | 0.511 | 2.53 |

-0.850 | 0.850 | 19.8 | 80.2 | 0.183 | 0.489 | 2.68 |

-0.900 | 0.900 | 18.4 | 81.6 | 0.165 | 0.467 | 2.83 |

-0.950 | 0.950 | 17.1 | 82.9 | 0.150 | 0.447 | 2.99 |

-1.000 | 1.000 | 15.9 | 84.1 | 0.135 | 0.427 | 3.15 |

-1.050 | 1.050 | 14.7 | 85.3 | 0.122 | 0.408 | 3.33 |

-1.100 | 1.100 | 13.6 | 86.4 | 0.111 | 0.390 | 3.52 |

-1.150 | 1.150 | 12.5 | 87.5 | 0.100 | 0.372 | 3.71 |

-1.200 | 1.200 | 11.5 | 88.5 | 0.091 | 0.356 | 3.92 |

-1.250 | 1.250 | 10.6 | 89.4 | 0.082 | 0.340 | 4.14 |

-1.300 | 1.300 | 9.7 | 90.3 | 0.074 | 0.325 | 4.37 |

-1.350 | 1.350 | 8.9 | 91.1 | 0.067 | 0.310 | 4.62 |

-1.400 | 1.400 | 8.1 | 91.9 | 0.061 | 0.297 | 4.88 |

-1.450 | 1.450 | 7.4 | 92.6 | 0.055 | 0.283 | 5.15 |

-1.500 | 1.500 | 6.7 | 93.3 | 0.050 | 0.271 | 5.44 |

-1.550 | 1.550 | 6.1 | 93.9 | 0.045 | 0.259 | 5.74 |

-1.600 | 1.600 | 5.5 | 94.5 | 0.041 | 0.247 | 6.07 |

-1.650 | 1.650 | 4.9 | 95.1 | 0.037 | 0.236 | 6.41 |

-1.700 | 1.700 | 4.5 | 95.5 | 0.033 | 0.226 | 6.77 |

-1.750 | 1.750 | 4.0 | 96.0 | 0.030 | 0.216 | 7.14 |

-1.800 | 1.800 | 3.6 | 96.4 | 0.027 | 0.206 | 7.55 |

-1.850 | 1.850 | 3.2 | 96.8 | 0.025 | 0.197 | 7.97 |

-1.900 | 1.900 | 2.9 | 97.1 | 0.022 | 0.188 | 8.42 |

-1.950 | 1.950 | 2.6 | 97.4 | 0.020 | 0.180 | 8.89 |

-2.000 | 2.000 | 2.3 | 97.7 | 0.018 | 0.172 | 9.39 |

-2.050 | 2.050 | 2.0 | 98.0 | 0.017 | 0.164 | 9.91 |

From these table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to be observedin real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Furthermore, the comparison of surgeon performance based on predicted random effects is likely to be misleading due to over-shrinkage.

]]>

For these simulations we will examine three different ranges of dataset sizes for each surgeon:

- small data (complication data from between 20-100 cases/ surgeon are available)
- large data (complications from between 200-1000 cases/surgeon)
- extra large data (complications from between 1000-2000 cases/surgeon)

We simulated 200 surgeons (“random effects”) from a normal distribution with a mean of zero and a standard deviation of 0.26, while the population average complication rate was set t0 5%. These numbers were chosen to reflect the range of values (average and population standard deviation) of the random effects in the Score Card dataset, while the use of 200 random effects was a realistic compromise with the computational capabilities of the Asus Transformer T100 2 in 1 laptop/tablet that I used for these analyses.

The following code was used to simulate the logistic case for small data (the large and extra large cases were simulated by changing the values of the Nmin and Nmax variables).

library(lme4) library(mgcv) ## helper functions logit<-function(x) log(x/(1-x)) invlogit<-function(x) exp(x)/(1+exp(x)) ## simulate cases simcase<-function(N,p) rbinom(N,1,p) ## simulation scenario pall<-0.05; # global average Nsurgeon<-200; # number of surgeons Nmin<-20; # min number of surgeries per surgeon Nmax<-100; # max number of surgeries per surgeon ## simulate individual surgical performance ## how many simulations of each scenario set.seed(123465); # reproducibility ind<-rnorm(Nsurgeon,0,.26) ; # surgical random effects logitind<-logit(pall)+ind ; # convert to logits pind<-invlogit(logitind); # convert to probabilities Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)), id=do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim))) complications$id<-factor(complications$id)

A random effect and fixed effect model were fit to these data (the fixed effect model is simply a series of independent fits to the data for each random effect):

## Random Effects fit2<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=2) ran2<-ranef(fit2)[["id"]][,1] c2<-cor(ran2,ind) int2<-fixef(fit2) ranind2<-ran2+int2 ## Fixed Effects fixfit<-vector("numeric",Nsurgeon) for(i in 1:Nsurgeon) { fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="binomial")$coef[1] }

The corresponding Gaussian GLMM cases were simulated by making minor changes to these codes. These are shown below:

simcase<-function(N,p) rnorm(N,p,1) fit2<-glmer(ev~1+(1|id),data=complications,nAGQ=2) fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="gaussian")$coef[1]

The predicted random effects were assessed against the simulated truth by smoothing regression splines. In these regressions, the intercept yields the bias of the average of the predicted random effects vis-a-vis the truth, while the slope of the regression quantifies the amount of shrinkage effected by the mixed model formulation. For unbiased estimation not only would we like the intercept to be zero, but also the slope to be equal to one. In this case, the predicted random effect would be equal to its true (simulated) value. Excessive shrinkage would result in a slope that is substantially different from one. Assuming that the bias (intercept) is not different from zero, the relaxation of the slope towards one quantifies the consistency and the bias (or rather its rate of convergence) of these estimators using simulation techniques (or so it seems to me).

The use of smoothing (flexible), rather than simple linear regression, to quantify these relationships does away with a restrictive assumption: that the amount of shrinkage is the same throughout the range of the random effects:

## smoothing spline (flexible) fit fitg<-gam(ran2~s(ind) ## linear regression fitl<-lm(ran2~ind)

The following figure shows the results of the flexible regression (black with 95% CI, dashed black) v.s. the linear regression (red) and the expected (blue) line (intercept of zero, slope of one).

Several observations are worth noting in this figure.*
First*, the flexible regression was indistinguishable from a linear regression in all cases; hence the red and black lines overlap. Stated in other terms, the amount of shrinkage was the same across the range of the random effect values.

An interesting question is whether these observations (overshrinkage of random effects from small sample sizes in logistic mixed regression) reflects the use of random effects in modeling, or whether they are simply due to the interplay between sample size and the non-linearity of the statistical model. Hence, I turned to fixed effects modeling of the same datasets. The results of these analyses are summarized in the following figure:

One notes that the distribution of the differences between the random and fixed effects relative to the true (simulated) values is nearly identical for the linear case (second row). In other words, the use of the implicit constraint of the mixed model, offers no additional advantage when estimating individual performance in this model. On the other hand there is a value in applying mixed modeling techniques for the logistic regression case. In particular, outliers (such as those arising for small samples) are eliminated by the use of random effect modeling. The difference between the fixed and the random effect approach progressively decreases for large sample sizes, implying that the benefit of the latter approach is lost for “extra large” cluster sizes.

One way to put these differences into perspective is to realize that the random effects for the logistic model correspond to log-odd ratios, relative to the population mean. Hence the difference between the predicted random effect and its true value, when exponentiated, corresponds to an Odd Ratio (OR). A summary of the odds ratios over the population of the random effects as a function of cluster size is shown below.

Metric 20-100 200-1000 1000-2000 Min 0.5082 0.6665 0.7938 Q25 0.8901 0.9323 0.9536 Median 1.0330 1.0420 1.0190 Mean 1.0530 1.0410 1.0300 Q75 1.1740 1.1340 1.1000 Max 1.8390 1.5910 1.3160

Even though the average Odds Ratio is close to 1, a substantial number of predicted random effects are far from the true value and yield ORs that are greater than 11% in either direction for small cluster sizes. **These observations have implications for the Score Card (or similar projects): if one were to use Random Effects modeling to focus on individuals, then unless the cluster sizes (observations per individual) are substantial, one would run a substantial risk of misclassifying individuals, even though one would be right on average!**

One could wonder whether these differences between the simulated truth and the predicted random effects arise as a result of the numerical algorithms of the *lme4* package. The latter was used by both the Surgeon Score Card project and our simulations so far and thus it would be important to verify that it performs up to specs. The major tuning variable for the algorithm is the order of the Adaptive Gaussian Quadrature (argument nAGQ). We did not find any substantial departures when the order of the quadrature was varied from 0 to 1 and 2. However, there is a possibility that the algorithm fails for all AGQ orders as it has to calculate probabilities that are numerically close to the boundary of the parameter space. We thus decided to fit the same model from a Bayesian perspective using Markov Chain Monte Carlo (MCMC) methods. The following code will fit the Bayesian model and graphs the true values of the effects used in the simulated dataset against the Bayesian estimates (the posterior mean) and also the *lme4* predictions. The latter tend to be numerically close to the posterior mode of the random effects when a Bayesian perspective is adopted.

## Fit the mixed effects logistic model from R using openbugs library("glmmBUGS") library(nlme) fitBUGS = glmmBUGS(ev ~ 1, data=complications, effects="id", family="bernoulli") startingValues = fitBUGS$startingValues source("getInits.R") require(R2WinBUGS) fitBUGSResult = bugs(fitBUGS$ragged, getInits, parameters.to.save = names(getInits()), model.file="model.txt", n.chain=3, n.iter=6000, n.burnin=2000, n.thin=10, program="openbugs", working.directory=getwd()) fitBUGSParams = restoreParams(fitBUGSResult , fitBUGS$ragged) sumBUGS<-summaryChain(fitBUGSParams ) checkChain(fitBUGSParams ) ## extract random effects cnames<-as.character(sort(as.numeric(row.names(sumBUGS$FittedRid)))) fitBUGSRE<-sumBUGS$Rid[cnames,1] ## plot against the simulated (true) effects and the lme4 estimates hist(ind,xlab="RE",ylim=c(0,3.8),freq=FALSE,main="") lines(density(fitBUGSRE),main="Bayesian",xlab="RE",col="blue") lines(density(ran2),col="red") legend(legend=c("Truth","lme4","MCMC"),col=c("black","red","blue"), bty="n",x=0.2,y=3,lwd=1)

The following figure shows the histogram of the true values of the random effects (black), the frequentist(lme4) estimates (red) and the Bayesian posterior means (blue).

It can be appreciated that both the Bayesian estimates and the lme4 predictions demonstrate considerable shrinkage relative to the true values for small cluster sizes (20-100). Hence, an lme4 numerical quirk seems an unlikely explanation for the shrinkage observed in the simulation.

**Summing up**:

- Random Effect modeling of binomial outcomes require a substantial number of observations per individual (cluster size) for the procedure to yield estimates of individual effects that are numerically indistinguishable from the true values
- Fixed effect modeling is even worse an approach for this problem
- Bayesian fitting procedures do not appear to yield numerically different effects from their frequentist counterparts

*These features should raise the barrier for accepting a random effects logistic modeling approach when the focus is on individual rather than population average effects*. Even though the procedure is certainly preferable to fixed effects regression, the direct use of the value of the predicted individual random effects as an effect measure will be problematic for small cluster sizes (e.g. a small number of procedures per surgeon). In particular, a substantial proportion of these estimated effects is likely to be far from the truth even if the model is unbiased on the average. These observations are of direct relevance to the Surgical Score Card in which the number of observations per surgeon were far lower than the average value in our simulations: 60 (small), 600 (large) and 1500 (extra large):

Procedure Code |
N (procedures) |
N(surgeons) |
Procedures per surgeon |

51.23 | 201,351 | 21,479 | 9.37 |

60.5 | 78,763 | 5,093 | 15.46 |

60.29 | 73,752 | 7,898 | 9.34 |

81.02 | 52,972 | 5,624 | 9.42 |

81.07 | 106,689 | 6,214 | 17.17 |

81.08 | 102,716 | 6,136 | 16.74 |

81.51 | 494,576 | 13,414 | 36.87 |

81.54 | 1,190,631 | 18,029 | 66.04 |

Total |
2,301,450 | 83,887 | 27.44 |

]]>

http://sites.stat.psu.edu/~dhunter/525/weekly/weibull.pdf

http://psfaculty.ucdavis.edu/bsjjones/slide3_parm.pdf

So be aware

]]>

Check out @freakonometrics’s Tweet: https://twitter.com/freakonometrics/status/602304149049495552?s=09

]]>

Stratification is one of the ways that one may address the violation of the proportionality assumption for a *categorical* covariate in the Cox model. The stratified Cox model resolves the overall hazard in the study as:

In the logarithmic scale, the multiplicative model for the stratified baseline hazard becomes an additive one. In particular, the specification of a different baseline hazard for the different levels of a factor amounts to specifying an interaction between the factor and the smooth baseline hazard in the PGAM.

We turn to the PBC dataset to provide an example of a stratified analysis with either the Cox model or the PGAM. In that dataset the covariate edema is a categorical variable assuming the values of 0 (no edema), 0.5 (untreated or successfully treated) and 1(edema despite treatment). An analysis of the Schoenfeld residual test shows that this covariate violates the proportionality assumption

> f<-coxph(Surv(time,status)~trt+age+sex+factor(edema),data=pbc) > Schoen<-cox.zph(f) > Schoen rho chisq p trt -0.089207 1.12e+00 0.2892 age -0.000198 4.72e-06 0.9983 sexf -0.075377 7.24e-01 0.3950 factor(edema)0.5 -0.202522 5.39e+00 0.0203 factor(edema)1 -0.132244 1.93e+00 0.1651 GLOBAL NA 8.31e+00 0.1400 >

To fit a stratified GAM model, we should transform the dataset to include additional variables, one for each level of the edema covariate. To make the PGAM directly comparable to the stratified Cox model, we have to fit the former without an intercept term. This requires that we include additional dummy variables for any categorical covariates that we would to adjust our model for. In this particular case, the only other additional covariate is the female gender:

pbcGAM<-transform(pbcGAM,edema0=as.numeric(edema==0), edema05=as.numeric(edema==0.5),edema1=as.numeric(edema==1), sexf=as.numeric(sex=="f"))

Then the stratifed Cox and PGAM models are fit as:

fGAM<-gam(gam.ev~s(stop,bs="cr",by=edema0)+s(stop,bs="cr",by=edema05)+ s(stop,bs="cr",by=edema1)+trt+age+sexf+offset(log(gam.dur))-1, data=pbcGAM,family="poisson",scale=1,method="REML") fs<-coxph(Surv(time,status)~trt+age+sex+strata(edema),data=pbc)

In general the values of covariates of the stratified Cox and the PGAM models are similar with the exception of the *trt* variable. However the standard error of this variable estimated by either model is so large, that the estimates are statistically no different from zero, despite their difference in magnitude

> fs Call: coxph(formula = Surv(time, status) ~ trt + age + sex + strata(edema), data = pbc) coef exp(coef) se(coef) z p trt 0.0336 1.034 0.18724 0.18 0.86000 age 0.0322 1.033 0.00923 3.49 0.00048 sexf -0.3067 0.736 0.24314 -1.26 0.21000 Likelihood ratio test=15.8 on 3 df, p=0.00126 n= 312, number of events= 125 (106 observations deleted due to missingness) > summary(fGAM) Family: poisson Link function: log Formula: gam.ev ~ s(stop, bs = "cr", by = edema0) + s(stop, bs = "cr", by = edema05) + s(stop, bs = "cr", by = edema1) + trt + age + sexf + offset(log(gam.dur)) - 1 Parametric coefficients: Estimate Std. Error z value Pr(>|z|) trt 0.002396 0.187104 0.013 0.989782 age 0.033280 0.009170 3.629 0.000284 *** sexf -0.297481 0.240578 -1.237 0.216262 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(stop):edema0 2.001 2.003 242.0 <2e-16 *** s(stop):edema05 2.001 2.001 166.3 <2e-16 *** s(stop):edema1 2.000 2.001 124.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = -0.146 Deviance explained = -78.4% REML score = 843.96 Scale est. = 1 n = 3120</pre>

]]>