Archive for the ‘Evidence Based Medicine’ Category

Estimating the mean and standard deviation from the median and the range

December 3, 2015

While preparing the data for a meta-analysis, I run into the problem that a few of my sources did not report the outcome of interest as means and standard deviations, but rather as medians and range of values. After looking around, I found this interesting paper which derived (and validated through simple simulations), simple formulas that can be used to convert the median/range into a mean and a variance in a distribution free fashion.  With

  • a = min of the data
  • b = max of the data
  • m = median
  • n = size of the sample

the formulas are as follows:

Mean  \bar{m} = \frac{a+2 m+b}{4} +\frac{a-2 m+b}{4 n}

Variance  \frac{1}{n-1} \Big(a^2+m^2+b^2+\frac{n-3}{2} \frac{(a+m)^2+(b+m)^2}{4}-n \bar{m} \Big)

 

The following R function will carry out these calculations

f<-function(a,m,b,n)
{
mn<-(a+2*m+b)/4+(a-2*m+b)/(4*n)
s=sqrt((a*a+m*m+b*b+(n-3)*((a+m)^2+(m+b)^2)/8-n*mn*mn)/(n-1))
c(mn,s)
}

Edit

Surfing around arxiv, I found another paper that handles additional scenarios and proposes alternative formulas

Advertisements

Summing up my criticism against Propublica’s Scorecard

August 26, 2015

The bottom-line: The major justification for my substantiated (or at least so I think) rant against the use of shrinkage in this case is the elephant in the room that no-one wants to talk about: the substantially limited information (avg number of complications per individual < 50) for otherwise infrequent events (avg complication rate <5%). In such a situation, shrinkage will make everyone look like everyone else, limiting the ability to draw meaningful complications by looking at the values of the random effects. In fact, PP had to rely on post hock classifications of the good, the bad and the ugly to overcome the unavoidable shrinkage towards the mean and overcome the lack of information that could distinguish one surgeon from another. Similar points (with more examples from the actual score card) were made by Ewen Harrison.

(You may stop reading now – or you may read past the rant in the next paragraph)

Rant Mode On: Another criticism that I received, is that I fail to understand random effects modeling, which I personally thought it was funny because one of my recent papers actually says one should scrap the Cox survival model for generalized additive models (which are just generalized linear mixed models in disguise).  In any case, assuming that my understanding of mixed models is poor, maybe my conclusion that these models are to be preferred for such applications is also problematic?

Since one may consider my vote-of-confidence-to-the-almighty-mix-model v.s. my criticism of their application in the Scorecard project, subtle evidence for paranoid schizophrenia let me sum up how one can simultaneously hold these beliefs:

Even though mixed models are to be preferred especially as more data accumulate(a point I make clear in the 3 blog posts I wrote), no modeling could overcome the severe lack of events in the scorecard database.

There are other technical issues that one could “rant” about e.g. how were the hospital and surgeon effects were combined, but I will not digress any further.

Rant mode off

Back to criticism: In my opinion a major reason that mixed models were used is the potentially large number of surgeons with zero complications in their limited observation records*. The use of shrinkage models allow one to report performance for such surgeons, and generate nice colorful graphs (with one and possibly two decimal points) about their performance instead of reporting a NaN (Not A Number). Incidentally, the shrinkage of the individual surgeon effects all the way towards the population mean is the mixed model’s way of telling you the same thing: since it could not estimate how these people actually performed, it might as well give back the only answer consistent with the model applied, i.e. everybody is (almost) like everybody else and thus everyone is average.

*Curiously this information, i.e. the number of surgeons without any complication in the source database is not provided, although I’d have thought it would be important to report this number in the shake of transparency.

Word of caution: For surgeons without any complications in the database, the actual information comes not from their complications but from the number of uncomplicated surgeries they performed. This raises the question of inclusion effects (a point Andrew Gelman thoroughly address in his book) and the associated selection biases inherent in comparing surgeons who accept v.s. those who do not accept Medicare (or accept too little of it) and the corresponding social determinants of health outcomes (beautifully explained here).

A non-statistical criticism: As a nephrologist, I find it amusingly insulting that Acute Kidney Injury (coded as Acute Renal Failure) was not considered a surgical complication, even though it certainly complicates surgeries. Surgery is in fact the leading cause of “hospital-acquired” AKI (40% of cases are preceded by a surgical procedure). But maybe, as Benjamin Davies pointed out, the real end-points are not really measured by the Scorecard. Or I could be just delusional when I point out to surgical colleagues that the selection of analgesia, fluid management and pre as well as post perioperative care DO play a role for some complications. Or it could be just the anesthesiologists’s fault 🙂

 

The little mixed model that could, but shouldn’t be used to score surgical performance

July 30, 2015

The Surgeon Scorecard

Two weeks ago, the world of medical journalism was rocked by the public release of ProPublica’s Surgeon Scorecard. In this project ProPublica “calculated death and complication rates for surgeons performing one of eight elective procedures in Medicare, carefully adjusting for differences in patient health, age and hospital quality.”  By making the dataset available through a user friendly portal, the intended use of this public resource was to “use this database to know more about a surgeon before your operation“.

The Scorecard was met with great enthusiasm and coverage by non-medical media. TODAY.com headline “nutshelled” the Scorecard as a resource that “aims to help you find doctors with lowest complication rates“. A (?tongue in cheek) NBC headline tells us the scorecard “It’s complicated“. On the other hand the project was not well received by my medical colleagues. John Mandrola gave it a failing grade in Medscape. Writing at KevinMD.com, Jeffrey Parks called it a journalistic low point for ProPublica. Jha Shaurabh pointed out a potential paradox  in a statistically informed, easy to read and highly entertaining piece. In this paradox, the surgeon with the higher complication case who takes high risk patients from a disadvantaged socio-economic environment, may actually be the surgeon one wants to perform one’s surgery! Ed Schloss summarized the criticism (in the blogosphere and twitter) in an open letter and asked for peer review of the Scorecard methodology.

The criticism to date has largely focused on the potential for selection effects (as the Scorecard is based on Medicare data, and does not include data from private insurers), the incomplete adjustment for confounders, the paucity of data for individual surgeons, the counting of complications and re-admission rates, decisions about risk category classification boundaries and even data errors (ProPublica’s response arguing that the Scorecard matters may be found here). With a few exceptions (e.g. see Schloss’s blogpost in which the complexity of the statistical approach is mentioned) the criticism of the statistical approach (including my own comments in twitter) has largely focused on these issues.

On the other hand, the underlying statistical methodology (here and there) that powers the Scorecard has not received much attention. Therefore I undertook a series of simulation experiments to explore the impact of the statistical procedures on the inferences afforded by the Scorecard.

The mixed model that could – a short non-technical summary of ProPublica’s approah

ProPublica’s approach to the scorecard is based on logistic regression model, in which individual surgeon (and hospital) performance (probability of suffering a complication) is modelled using Gaussian random effects, while patient level characteristics that may act as confounders are adjusted for, using fixed effects. In a nutshell this approach implies fitting a model of the average complication rate that is function of the fixed effects (e.g. patient age) for the entire universe of surgeries  performed in the USA. Individual surgeon and hospital factors modify this complication rate, so that a given surgeon and hospital will have an individual  rate that varies around the population average. These individual surgeon and hospital factors are constrained to follow Gaussian, bell-shaped distribution when analyzing complication data. After model fitting, these predicted random effects are used to quantify and compare surgical performance. A feature of mixed modeling approaches is the unavoidable shrinkage of the raw complication rate towards the population mean. Shrinkage implies that the dynamic range of the  actually observed complication rates is compressed. This is readily appreciated in the figures generated by the ProPublica analytical team:

Capture

In their methodological white paper the ProPublica team notes:

While raw rates ranged from 0.0 to 29%, the Adjusted Complication Rate goes from 1.1 to 5.7%. …. shrinkage is largely a result of modeling in the first place, not due to adjusting for case mix. This shrinkage is another piece of the measured approach we are taking: we are taking care not to unfairly characterize surgeons and hospitals.”

These features should alert us that something is going on. For if a model can distort the data to such a large extent, then the model should be closely scrutinized before being accepted. In fact, given these observations,  it is possible that one mistakes the noise from the model for the information hidden in the empirical data. Or, even more likely, that one is not using the model in the most productive manner.

Note that these comments  should not be interpreted as a criticism against the use of mixed models in general, or even for the particular aspects of the Scorecard project. They are rather a call for re-examining the modeling assumptions and for gaining a better understanding of the model “mechanics of prediction” before releasing the Kraken to the world.

The little mixed model that shouldn’t

There are many technical aspects one could potentially misfire in a Generalized Linear Mixed Model for complication rates. Getting the wrong shape of the random effects distribution is of may or may not be of concern (e.g. assuming it is bell shaped when it is not). Getting the underlying model wrong, e.g. assuming the binomial model for complication rates while a model with many more zeros (a zero inflated model) may be more appropriate, is yet another potential problem area. However, even if  these factors are not operational, then one may still be misled when using the results of the model. In particular, the major area of concern for such models is the cluster size: the number of observations per individual random effect (e.g. surgeon) in the dataset. It is this factor, rather than the actual size of the dataset that determines the precision in the individual random affects. Using a toy example, we show that the number of observations per surgeon typical of the Scorecard dataset, leads to predicted  random effects that may be far from their true value. This seems to stem from the non-linear nature of the logistic regression model. As we conclude in our first technical post:

  • Random Effect modeling of binomial outcomes require a substantial number of observations per individual (in the order of thousands) for the procedure to yield estimates of individual effects that are numerically indistinguishable  from the true values.

 

Contrast this conclusion to the cluster size in the actual scorecard:

Procedure Code N (procedures) N(surgeons) Procedures per surgeon
51.23 201,351 21,479 9.37
60.5 78,763 5,093 15.46
60.29 73,752 7,898 9.34
81.02 52,972 5,624 9.42
81.07 106,689 6,214 17.17
81.08 102,716 6,136 16.74
81.51 494,576 13,414 36.87
81.54 1,190,631 18,029 66.04
Total 2,301,450 83,887 27.44

In a follow up simulation study we demonstrate that this feature results in predicted individual effects that are non-uniformly shrank towards their average value. This compromises the ability of mixed model predictions to separate the good from the bad “apples”.

In the second technical post, we undertook a simulation study to understand the implications of over-shrinkage for the Scorecard project. These are best understood by a numerical example from one of simulated datasets. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good Bad Quantile (Good) Quantile (Bad) True OR Pred OR Shrinkage Factor
-0.050 0.050 48.0 52.0 0.905 0.959 1.06
-0.100 0.100 46.0 54.0 0.819 0.920 1.12
-0.150 0.150 44.0 56.0 0.741 0.883 1.19
-0.200 0.200 42.1 57.9 0.670 0.847 1.26
-0.250 0.250 40.1 59.9 0.607 0.813 1.34
-0.300 0.300 38.2 61.8 0.549 0.780 1.42
-0.350 0.350 36.3 63.7 0.497 0.749 1.51
-0.400 0.400 34.5 65.5 0.449 0.719 1.60
-0.450 0.450 32.6 67.4 0.407 0.690 1.70
-0.500 0.500 30.9 69.1 0.368 0.662 1.80
-0.550 0.550 29.1 70.9 0.333 0.635 1.91
-0.600 0.600 27.4 72.6 0.301 0.609 2.02
-0.650 0.650 25.8 74.2 0.273 0.583 2.14
-0.700 0.700 24.2 75.8 0.247 0.558 2.26
-0.750 0.750 22.7 77.3 0.223 0.534 2.39
-0.800 0.800 21.2 78.8 0.202 0.511 2.53
-0.850 0.850 19.8 80.2 0.183 0.489 2.68
-0.900 0.900 18.4 81.6 0.165 0.467 2.83
-0.950 0.950 17.1 82.9 0.150 0.447 2.99
-1.000 1.000 15.9 84.1 0.135 0.427 3.15
-1.050 1.050 14.7 85.3 0.122 0.408 3.33
-1.100 1.100 13.6 86.4 0.111 0.390 3.52
-1.150 1.150 12.5 87.5 0.100 0.372 3.71
-1.200 1.200 11.5 88.5 0.091 0.356 3.92
-1.250 1.250 10.6 89.4 0.082 0.340 4.14
-1.300 1.300 9.7 90.3 0.074 0.325 4.37
-1.350 1.350 8.9 91.1 0.067 0.310 4.62
-1.400 1.400 8.1 91.9 0.061 0.297 4.88
-1.450 1.450 7.4 92.6 0.055 0.283 5.15
-1.500 1.500 6.7 93.3 0.050 0.271 5.44
-1.550 1.550 6.1 93.9 0.045 0.259 5.74
-1.600 1.600 5.5 94.5 0.041 0.247 6.07
-1.650 1.650 4.9 95.1 0.037 0.236 6.41
-1.700 1.700 4.5 95.5 0.033 0.226 6.77
-1.750 1.750 4.0 96.0 0.030 0.216 7.14
-1.800 1.800 3.6 96.4 0.027 0.206 7.55
-1.850 1.850 3.2 96.8 0.025 0.197 7.97
-1.900 1.900 2.9 97.1 0.022 0.188 8.42
-1.950 1.950 2.6 97.4 0.020 0.180 8.89
-2.000 2.000 2.3 97.7 0.018 0.172 9.39
-2.050 2.050 2.0 98.0 0.017 0.164 9.91

From this table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to happen in real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Hence, the comparison of  surgeon performance based on these random effect predictions is likely to be misleading due to over-shrinkage.

Where to go from here?

ProPublica should be congratulated for taking up such an ambitious, and ultimately useful project. However, the limitations of the adopted approach should make one very skeptical about accepting the inferences from their modeling tool. In particular, the small number of observations per surgeon limits the utility of the predicted random effects to directly compare surgeons due to over-shrinkage. Further studies are required before one could use the results of mixed effects modeling for this application. Based on some limited simulation experiments (that we do not present here), it seems that relative rankings of surgeons may be robust measures of surgical performance, at least compared to the absolute rates used by the Scorecard. Adding my voice to that of Dr Schloss, I think it is time for an open and transparent dialogue (and possibly a “crowdsourced” research project) to better define the best measure of surgical performance given the limitations of the available data. Such a project could also explore other directions, e.g. the explicit handling of zero inflation and even go beyond the ubiquitous bell shaped curve. By making the R code available, I hope that someone (possibly ProPublica) who can access more powerful computational resources can perform more extensive simulations. These may better define other aspects of the modeling approach and suggest improvements in the scorecard methodology. In the meantime, it is probably a good idea not to exclusively rely on the numerical measures of the scorecard when picking up the surgeon who will perform your next surgery.

 

 

Shrinkage of predicted random effects in logistic regression models

July 30, 2015

As a follow-up of our initial investigation of the bias of random effect predictions in generalized linear mixed regression models, I undertook a limited simulation experiment. In this experiment, we varied the population average complication rate and the associated standard deviation of the individual random effect and generated a small number of replications (N=200) for each combination of population mean and standard deviation. Limited by the computational resources of (the otherwise wonderful!) tablet, I had to forego a more thorough and more precise (larger number of replicates) examination of bias, variance and coverage. The focus is on general patterns rather than precise estimates; by making the code public, these can be obtained by anyone who has access to the R computing environment.

The basic code that simulates  from these logistic regression models is shown below. Although we only examine small cluster sizes (number of observations per cluster 20-100) and a moderate number of individual random effects (200) it is straightforward to modify these aspects of the simulation study. For these simulations we examined a limited number of values for the standard deviation of the random effects (0.1, 0.2 and 0.5) and overall complication rates (0.05, 0.10, 0.15, 0.20) to reflect the potential range of values compatible with the raw data in the Surgeon Scorecard.

library(lme4)
library(mgcv)
## helper functions
logit<-function(x) log(x/(1-x))
invlogit<-function(x) exp(x)/(1+exp(x))

## code allows a sensitivity analysis to the order of the nAGQ
## explanation of symbols/functions appearing inside the
## simulation function
## fit: glmer fit using a X point nAGQ integration rule
## ran: random effects using a X point nAGQ integration rule
##   cR: correlation between fitted and simulated random effect
##       the fitted were obtained with a X point nAGQ integrator
## int: intercept obtained with a X point nAGQ integration rule
## biG: bias of the fitted v.s. the simulated random effects
##      this is obtained as the intercept of a GAM regression of
##      the fitted versus the simulated values
## bi0: bias of the intercept from the glmer fit
## sdR: estimated standard deviation of the random eff distro
## bs0: bias of the standard deviation of the random effects (SDR)
## edf: non-linearity of the slope of a GAM regression of the
##      fitted v.s. the simulated residuals
## slQ: derivative of the non-linear slope at the Qth quantile
simFit<-function(X,complications,ind,pall,SDR,Q=0:10/10) {
fit<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=X)
ran<-ranef(fit)[["id"]][,1] ## predicted (mode) of random effects
cR<-cor(ran,ind)
int<-fixef(fit)
## regress on the residuals to assess performance
fitg<-gam(ran~s(ind,bs="ts")) ## gam fit
edf<-sum(fitg$edf[-1])## edf of the non-linear "slope"
biG<-fitg$coef[1]
bi0<-int-logit(pall)
sdR<-sqrt(as.data.frame(VarCorr(fit))$vcov)
bs0<-sdR-SDR
sQ<-quantile(ind,prob=Q)
slQ<-(predict(fitg,newdata=data.frame(ind=sQ+0.0001),type="terms")[,1]-
predict(fitg,newdata=data.frame(ind=sQ),type="terms")[,1])/0.0001
names(slQ)<-paste("slQ",Q,sep="")
ret<-cbind(int=int,edf=edf,biG=biG,bi0=bi0,bs0=bs0,sdR=sdR,cR=cR,t(slQ),pall=pall,SDR=SDR)
}

## function to simulate cases
simcase<-function(N,p) rbinom(N,1,p)
## simulation scenarios: fixed for each simulation
Nsurgeon<-200; # number of surgeons
Nmin<-20; # min number of surgeries per surgeon
Nmax<-100; # max number of surgeries per surgeon
Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon

## simulate individual surgical performance
## the reality consists of different combos of pall and the sd of the
## random effect

Nscenariosim<-200 ## number of simulations per scenario
scenarios<-expand.grid(iter=1:Nscenariosim,pall=seq(0.05,.20,.05),
sd=c(0.1,0.2,.5))

## simulate indivindual scenarios
simScenario<-function(seed,pall,sd,X,Q)
{
set.seed(seed)
ind<-rnorm(Nsurgeon,0,sd) ; # surgical random effects
logitind<-logit(pall)+ind ; # convert to logits
pind<-invlogit(logitind); # convert to probabilities

complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)),
id=factor(do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim))))
simFit(X,complications,ind,pall,sd,Q)
}
X<-0
Q=0:10/10

system.time(perf<-mapply(simScenario,scenarios$iter,scenarios$pall,scenarios$sd,
MoreArgs=list(X=X,Q=Q),SIMPLIFY=FALSE))

perf<-do.call(rbind,perf)

The datasets are generated by the function simScenario, which when applied over all combinations of population mean(pall, treated as a fixed effect) and random effect standard deviation (sd) generates the synthetic dataset (‘complications’ variable). Once the data have been generated the function simFit receives the synthetic dataset, fits the mixed logistic regression model and generates random effect predictions.

To assess and quantify shrinkage in these regression models, we compare the predicted random effects against the simulated values. Due to the potentially large number of these effects (10s of thousands in the Surgeon Score Card), we developed a methodology that looks for patterns in these bivariate relationships, taking into account the magnitude of each simulated and predicted random effect pair. This methodology rests on flexible regression between the predicted against the simulated random effect in each of the synthetic datasets. This flexible regression, using smoothing splines, generalizes linear regression and thus avoids the assumption that the shrinkage (as assessed by the slope of the linear regression line) is the same irrespective of the value of the simulated random effect. By numerically differentiating the spline at various quantiles of the simulated random effect, we thus have an estimate of shrinkage. In particular, this estimate gives the change in value of the predicted random effect for a unit change in the simulated one. The relevant code that fits the smoothing spline and differentiates it in (user defined) grid of random effects quantiles is shown below:

</pre>
<pre>fitg<-gam(ran~s(ind,bs="ts")) ## gam fit
edf<-sum(fitg$edf[-1])## edf of the non-linear "slope"
biG<-fitg$coef[1]
bi0<-int-logit(pall)
sdR<-sqrt(as.data.frame(VarCorr(fit))$vcov)
bs0<-sdR-SDR
sQ<-quantile(ind,prob=Q)
slQ<-(predict(fitg,newdata=data.frame(ind=sQ+0.0001),type="terms")[,1]-
predict(fitg,newdata=data.frame(ind=sQ),type="terms")[,1])/0.0001
names(slQ)<-paste("slQ",Q,sep="")

A representative smoothing spline analysis of a mixed logistic regression model is shown in the figure below, along with the corresponding linear fit, and the line signifying zero shrinkage: a straight line passing from the origin with a unit slope. If the latter relationship is observed, then the predicted random effects are unbiased with respect to the simulated ones.

 

Predicted (y-axis) v.s. simulated (x-axis) random effects in a logistic regression model. 2000 random effects were simulated with small cluster sizes (20-100). Shown are the flexible spline fits (green), linear regression fit (red) and the line of no shrinkage (blue)

This figure shows the potential for random effect predictions to be excessively shrunken relative to the simulated ones (note that the blue line is more horizontal relative to the blue one). One should also observe that the amount of shrinkage is not the same throughout the range of the random effects: positive values (corresponding to individuals with higher complication rates) are shrank relatively less (the green line is closer to the blue line), relative to negative random effects (complication rates less than the population average). The net effect of this non-linear relationship is that the separation between “good” (negative RE) and “bad” (positive values of the RE) is decreased in the predicted relative to the simulated random effects. This can be a rather drastic compression in dynamic range for small cluster sizes as we will see below. 

From each mixed logistic regression model we obtain a large number of information : i) the bias in the fixed effect estimate, ii) the bias in the population standard deviation, iii) the bias in the overall relationship between predicted and simulated residuals (indirectly assessing whether the random effects are centered around the fixed effect) iv) the non-linearity of the relationship between the predicted and the simulated random effects (this is given by the estimated degrees of freedom of the spline, with an edf = 1 signifying a linear fit and higher degrees of freedom non linear relationships; edfs that are less than one signify a more complex pattern of shrinkage of a variable proportion of the random effects to a single point, i.e. the fixed effect) v) the linear slope

There was no obvious bias in the estimation of the intercept (the only fixed effect in our simulation study) for various combinations of overall event rate and random effect standard deviation (but note higher dispersion for large standard deviation values, indicating the need for a higher number of simulation replicates in order to improve precision):

Bias in the estimation of the intercept in the mixed logistic regression model (shown in log scale)

Bias in the estimation of the intercept in the mixed logistic regression model (shown in log scale)

 

Similarly there was no bias in the estimation of the population standard deviation of the random effects:

Bias in the standard deviation of the random effects

Bias in the standard deviation of the random effects

The estimated relationship between predicted and simulated residuals was in linear for approximately 50% of simulated samples for small to moderate standard deviations. However it was non-linear (implying differential shrinkage according to random effect magnitude) for larger standard deviations, typical of the standard deviation of the surgeon random effect in the Surgeon Score Card.

Estimated degrees of freedom (a measure of non-linearity) in the relationship between predicted and simulated random effects over simulation replicates.

Estimated degrees of freedom (a measure of non-linearity) in the relationship between predicted and simulated random effects over simulation replicates. The red horizontal line corresponds to a linear relationship between predicted and simulated random effects (edf=1).Edf<1 signify the shrinkage of a variable proportion of random effects to zero

 

The magnitude of the slopes of the smoothing splines at different quartiles of the random effects distribution and for combinations of overall rate and random effect standard deviation are shown below:

Slope of the smoothing spline at different quantiles (20%, 50%, 80%) of the random effect distribution. Boxplots are over 200 replicates of each combination of overall rate and random effect standard deviation

Slope of the smoothing spline at different quantiles (20%, 50%, 80%) of the random effect distribution. Boxplots are over 200 replicates of each combination of overall rate and random effect standard deviation

There are several things to note in this figure:

  1. the amount of shrinkage decreases as the population standard deviation increases (going from left to right in each row of the figure the slopes increase from zero towards one)
  2. the amount of shrinkage decreases as the overall average increases for the same value of the standard deviation (going from top to bottom in the same column)
  3. the degree of shrinkage appears to be the same across the range of random effect quantiles for small to moderate values of the population standard deviation
  4. the discrepancy between the degree of shrinkage is maximized for larger values of the standard deviation of the random effects and small overall rates (top right subpanel). This is the situation that is more relevant for the Surgeon Score Card based on the analyses reported by this project.

What are the practical implications of these observations for the individual surgeon comparison reported in the Score Card project? These are best understood by a numerical example from one of these 200 hundred datasets shown in the top right subpanel. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios, irrespective of whether they are predicted or (the simulated) true effects. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good Bad Quantile (Good) Quantile (Bad) True OR Pred OR Shrinkage Factor
-0.050 0.050 48.0 52.0 0.905 0.959 1.06
-0.100 0.100 46.0 54.0 0.819 0.920 1.12
-0.150 0.150 44.0 56.0 0.741 0.883 1.19
-0.200 0.200 42.1 57.9 0.670 0.847 1.26
-0.250 0.250 40.1 59.9 0.607 0.813 1.34
-0.300 0.300 38.2 61.8 0.549 0.780 1.42
-0.350 0.350 36.3 63.7 0.497 0.749 1.51
-0.400 0.400 34.5 65.5 0.449 0.719 1.60
-0.450 0.450 32.6 67.4 0.407 0.690 1.70
-0.500 0.500 30.9 69.1 0.368 0.662 1.80
-0.550 0.550 29.1 70.9 0.333 0.635 1.91
-0.600 0.600 27.4 72.6 0.301 0.609 2.02
-0.650 0.650 25.8 74.2 0.273 0.583 2.14
-0.700 0.700 24.2 75.8 0.247 0.558 2.26
-0.750 0.750 22.7 77.3 0.223 0.534 2.39
-0.800 0.800 21.2 78.8 0.202 0.511 2.53
-0.850 0.850 19.8 80.2 0.183 0.489 2.68
-0.900 0.900 18.4 81.6 0.165 0.467 2.83
-0.950 0.950 17.1 82.9 0.150 0.447 2.99
-1.000 1.000 15.9 84.1 0.135 0.427 3.15
-1.050 1.050 14.7 85.3 0.122 0.408 3.33
-1.100 1.100 13.6 86.4 0.111 0.390 3.52
-1.150 1.150 12.5 87.5 0.100 0.372 3.71
-1.200 1.200 11.5 88.5 0.091 0.356 3.92
-1.250 1.250 10.6 89.4 0.082 0.340 4.14
-1.300 1.300 9.7 90.3 0.074 0.325 4.37
-1.350 1.350 8.9 91.1 0.067 0.310 4.62
-1.400 1.400 8.1 91.9 0.061 0.297 4.88
-1.450 1.450 7.4 92.6 0.055 0.283 5.15
-1.500 1.500 6.7 93.3 0.050 0.271 5.44
-1.550 1.550 6.1 93.9 0.045 0.259 5.74
-1.600 1.600 5.5 94.5 0.041 0.247 6.07
-1.650 1.650 4.9 95.1 0.037 0.236 6.41
-1.700 1.700 4.5 95.5 0.033 0.226 6.77
-1.750 1.750 4.0 96.0 0.030 0.216 7.14
-1.800 1.800 3.6 96.4 0.027 0.206 7.55
-1.850 1.850 3.2 96.8 0.025 0.197 7.97
-1.900 1.900 2.9 97.1 0.022 0.188 8.42
-1.950 1.950 2.6 97.4 0.020 0.180 8.89
-2.000 2.000 2.3 97.7 0.018 0.172 9.39
-2.050 2.050 2.0 98.0 0.017 0.164 9.91

From these table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to be observedin real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Furthermore, the comparison of  surgeon performance based on predicted random effects is likely to be misleading due to over-shrinkage.

 

 

 

 

Survival Analysis With Generalized Additive Models: Part V (stratified baseline hazards)

May 9, 2015

In the fifth part of this series we will examine the capabilities of Poisson GAMs to stratify the baseline hazard for survival analysis. In a stratified Cox model, the baseline hazard is not the same for all individuals in the study. Rather, it is assumed that the baseline hazard may differ between members of groups, even though it will be the same for members of the same group.

Stratification is one of the ways that one may address the violation of the proportionality assumption for a categorical covariate in the Cox model. The stratified Cox model resolves the overall hazard in the study as:

h_{g}(t,X) = h_{0_{g}}(t)exp(\boldsymbol{x\beta}) ,\quad g=1,2,\dotsc ,g_{K}

In the logarithmic scale, the multiplicative model for the stratified baseline hazard becomes an additive one. In particular, the specification of a different baseline hazard for the different levels of a factor amounts to specifying an interaction between the factor and the smooth baseline hazard in the PGAM.

We turn to the PBC dataset to provide an example of a stratified analysis with either the Cox model or the PGAM. In that dataset the covariate edema is a categorical variable assuming the values of 0 (no edema), 0.5 (untreated or successfully treated) and 1(edema despite treatment). An analysis of the Schoenfeld residual test shows that this covariate violates the proportionality assumption

> f<-coxph(Surv(time,status)~trt+age+sex+factor(edema),data=pbc)
> Schoen<-cox.zph(f)
> Schoen
rho chisq p
trt -0.089207 1.12e+00 0.2892
age -0.000198 4.72e-06 0.9983
sexf -0.075377 7.24e-01 0.3950
factor(edema)0.5 -0.202522 5.39e+00 0.0203
factor(edema)1 -0.132244 1.93e+00 0.1651
GLOBAL NA 8.31e+00 0.1400
> 

To fit a stratified GAM model, we should transform the dataset to include additional variables, one for each level of the edema covariate. To make the PGAM directly comparable to the stratified Cox model, we have to fit the former without an intercept term. This requires that we include additional dummy variables for any categorical covariates that we would to adjust our model for. In this particular case, the only other additional covariate is the female gender:

pbcGAM<-transform(pbcGAM,edema0=as.numeric(edema==0),
edema05=as.numeric(edema==0.5),edema1=as.numeric(edema==1),
sexf=as.numeric(sex=="f"))

 

Then the stratifed Cox and PGAM models are fit as:


fGAM<-gam(gam.ev~s(stop,bs="cr",by=edema0)+s(stop,bs="cr",by=edema05)+
s(stop,bs="cr",by=edema1)+trt+age+sexf+offset(log(gam.dur))-1,
data=pbcGAM,family="poisson",scale=1,method="REML")

fs<-coxph(Surv(time,status)~trt+age+sex+strata(edema),data=pbc)

In general the values of covariates of the stratified Cox and the PGAM models are similar with the exception of the trt variable. However the standard error of this variable estimated by either model is so large, that the estimates are statistically no different from zero, despite their difference in magnitude

> fs
Call:
coxph(formula = Surv(time, status) ~ trt + age + sex + strata(edema), 
 data = pbc)


 coef exp(coef) se(coef) z p
trt 0.0336 1.034 0.18724 0.18 0.86000
age 0.0322 1.033 0.00923 3.49 0.00048
sexf -0.3067 0.736 0.24314 -1.26 0.21000

Likelihood ratio test=15.8 on 3 df, p=0.00126 n= 312, number of events= 125 
 (106 observations deleted due to missingness)
> summary(fGAM)

Family: poisson 
Link function: log 

Formula:
gam.ev ~ s(stop, bs = "cr", by = edema0) + s(stop, bs = "cr", 
 by = edema05) + s(stop, bs = "cr", by = edema1) + trt + age + 
 sexf + offset(log(gam.dur)) - 1

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
trt 0.002396 0.187104 0.013 0.989782 
age 0.033280 0.009170 3.629 0.000284 ***
sexf -0.297481 0.240578 -1.237 0.216262 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop):edema0 2.001 2.003 242.0 <2e-16 ***
s(stop):edema05 2.001 2.001 166.3 <2e-16 ***
s(stop):edema1 2.000 2.001 124.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) = -0.146 Deviance explained = -78.4%
REML score = 843.96 Scale est. = 1 n = 3120</pre>

Survival Analysis With Generalized Additive Models : Part IV (the survival function)

May 3, 2015

The ability of PGAMs to estimate the log-baseline hazard rate, endows them with the capability to be used as smooth alternatives to the Kaplan Meier curve. If we assume for the shake of simplicity that there are no proportional co-variates in the PGAM regression, then the quantity modeled  corresponds to the log-hazard of the  survival function. Note that the only assumptions made by the PGAM is that the a) log-hazard is a smooth function, with b) a given maximum complexity (number of degrees of freedom) and c) continuous second derivatives. A PGAM provides estimates of the log-hazard constant, \beta_{0}, and the time-varying deviation, \lambda(t_{i,j}). These can be used to predict the value of the survival function, S(t), by approximating the integral appearing in the definition of S(t) by numerical quadrature.

S(t_{i})=\exp\left(-\int_{0}^{t_{i}}h(t)\mathrm{\, d}t\right)\approx\exp\left(-\sum_{j=1}^{N_{i}}w_{i,j}\exp(\beta_{0}+\lambda(t_{i,j}))\right)

From the above definition it is obvious that the value of the survival distribution at any given time point is a non-linear function of the PGAM estimate. Consequently, the predicted survival value, S_{pred}(t), cannot be derived in closed form; as with all non-linear PGAM estimates, a simple Monte Carlo simulation algorithm may be used to derive both the expected value of \hat{S}_{pred}(t) and its uncertainty. For the case of the survival function, the simulation steps are provided in Appendix (Section A3) of our paper. The following R function can be used to predict the survival function and an associated confidence interval at a grid of points. It accepts as arguments a) the vector of time points, b) a PGAM object for the fitted log-hazard function, c) a list with the nodes and weights of a Gauss-Lobatto rule for the integration of the predicted survival, d) the number of Monte Carlo samples to obtain and optionally e) the seed of the random number generation. Of note, the order of the quadrature used to predict the survival function is not necessarily the same as the order used to fit the log-hazard function.

## Calculate survival and confidence interval over a grid of points
## using a GAM
SurvGAM<-Vectorize(function(t,gm,gl.rule,CI=0.95,Nsim=1000,seed=0)
##         t : time at which to calculate relative risk
##        gm : gam model for the fit
##   gl.rule : GL rule (list of weights and nodes)
##        CI : CI to apply
##      Nsim : Number of replicates to draw
##      seed : RNG seed
{
q<-(1-CI)/2.0
## create the nonlinear contrast
pdfnc<-data.frame(stop=t,gam.dur=1)
L<-length(gl.rule$x)
start<-0; ## only for right cens data
## map the weights from [-1,1] to [start,t]
gl.rule$w<-gl.rule$w*0.5*(t-start)
## expand the dataset
df<-Survdataset(gl.rule,pdfnc,fu=1)
## linear predictor at each node
Xp <- predict(gm,newdata=df,type="lpmatrix")
## draw samples
set.seed(seed)
br <- rmvn(Nsim,coef(gm),gm$Vp)
res1<-rep(0,Nsim)
for(i in 1:Nsim){
## hazard function at the nodes
hz<-exp(Xp%*%br[i,])
## cumumative hazard
chz1<-gl.rule$w %*% hz[1:L,]
##survival
res1[i]<-exp(-chz1)
}
ret<-data.frame(t=t,S=mean(res1),
LCI=quantile(res1,prob=q),
UCI=quantile(res1,prob=1-q))
ret

},vectorize.args=c("t"))

The function makes use of another function, Survdataset, that expands internally the vector of time points into a survival dataset. This dataset is used to obtain predictions of the log-hazard function by calling the predict function from the mgcv package.

## Function that expands a prediction dataset
## so that a GL rule may be applied
## Used in num integration when generating measures of effect
Survdataset<-function(GL,data,fu)
## GL  : Gauss Lobatto rule
## data: survival data
##   fu: column number containing fu info

{
## append artificial ID in the set
data$id<-1:nrow(data)
Gllx<-data.frame(stop=rep(GL$x,length(data$id)),
t=rep(data[,fu],each=length(GL$x)),
id=rep(data$id,each=length(GL$x)),
start=0)
## Change the final indicator to what
## was observed, map node positions,
## weights from [-1,1] back to the
## study time
Gllx<-transform(Gllx,
stop=0.5*(stop*(t-start)+(t+start)))
## now merge the remaining covariate info

Gllx<-merge(Gllx,data[,-c(fu)])
nm<-match(c("t","start","id"),colnames(Gllx))
Gllx[,-nm]
}

The ability to draw samples from the multivariate normal distribution corresponding to the model estimates and its covariance matrix is provided by another function, rmvn:

## function that draws multivariate normal random variates with
## a given mean vector and covariance matrix
##    n : number of samples to draw
##   mu : mean vector
##  sig : covariance matrix
rmvn <- function(n,mu,sig) { ## MVN random deviates
L <- mroot(sig);m <- ncol(L);
t(mu + L%*%matrix(rnorm(m*n),m,n))
}

To illustrate the use of these functions we revisit the PBC example from the 2nd part of this blog series. Firstly, let’s obtain a few Gauss-Lobatto lists of weights/nodes for the integration of the survival function:


## Obtain a few Gauss Lobatto rules to integrate the survival
## distribution
GL5<-GaussLobatto(5);
GL10<-GaussLobatto(10);
GL20<-GaussLobatto(20);

Subsequently, we fit the log-hazard rate to the coarsely (5 nodes) and more finely discretized (using a 10 point Gauss Lobatto rule) versions of the PBC dataset, created in Part 2. The third command obtains the Kaplan Meier estimate in the PBC dataset.


fSurv1<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)),
data=pbcGAM,family="poisson",scale=1,method="REML")
fSurv2<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)),
data=pbcGAM2,family="poisson",scale=1,method="REML")

KMSurv<-survfit(Surv(time,status)~1,data=pbc)

We obtained survival probability estimates for the 6 combinations of time discretization for fitting (either a 5 or 10th order Lobatto rule) and prediction (a 5th, 10th or 20th order rule):

t<-seq(0,4500,100)
s1<-SurvGAM(t,fSurv1,GL5)
s2<-SurvGAM(t,fSurv1,GL10)
s3<-SurvGAM(t,fSurv1,GL20)
s4<-SurvGAM(t,fSurv2,GL5)
s5<-SurvGAM(t,fSurv2,GL10)
s6<-SurvGAM(t,fSurv2,GL20)

In all cases 1000 Monte Carlo samples were obtained for the calculation of survival probability estimates and their pointwise 95% confidence intervals. We can plot these against the Kaplan Meier curve estimates:

par(mfrow=c(2,3))
plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL5)")
lines(s1[1,],s1[2,],col="blue",lwd=2)
lines(s1[1,],s1[3,],col="blue",lwd=2,lty=2)
lines(s1[1,],s1[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL10)")
lines(s2[1,],s2[2,],col="blue",lwd=2)
lines(s2[1,],s2[3,],col="blue",lwd=2,lty=2)
lines(s2[1,],s2[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL20)")
lines(s3[1,],s3[2,],col="blue",lwd=2)
lines(s3[1,],s3[3,],col="blue",lwd=2,lty=2)
lines(s3[1,],s3[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL5)")
lines(s4[1,],s4[2,],col="blue",lwd=2)
lines(s4[1,],s4[3,],col="blue",lwd=2,lty=2)
lines(s4[1,],s4[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL10)")
lines(s5[1,],s5[2,],col="blue",lwd=2)
lines(s5[1,],s5[3,],col="blue",lwd=2,lty=2)
lines(s5[1,],s5[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL20)")
lines(s6[1,],s6[2,],col="blue",lwd=2)
lines(s6[1,],s6[3,],col="blue",lwd=2,lty=2)
lines(s6[1,],s6[4,],col="blue",lwd=2,lty=2)

Survival probability estimates: Kaplan Meier curve (black) v.s. the PGAM estimates for different orders of Gauss Lobatto (GL) quadrature

Survival probability estimates: Kaplan Meier curve (black) v.s. the PGAM estimates for different orders of Gauss Lobatto (GL) quadrature

Overall, there is a close agreement between the Kaplan Meier estimate and the PGAM estimates despite the different function spaces that the corresponding estimators “live”: the space of all piecewise constant functions (KM) v.s. that of the smooth functions with bounded, continuous second derivatives (PGAM). Furthermore, the 95% confidence interval of each estimator (dashed lines) contain the expected value of the other estimator. This suggests that there is no systematic difference between the KM and the PGAM survival estimators. This was confirmed in simulated datasets (see Fig 2 in our PLoS One paper).

Survival Analysis With Generalized Additive Models : Part III (the baseline hazard)

May 2, 2015

In the third part of the series on survival analysis with GAMs we will review the use of the baseline hazard estimates provided by this regression model. In contrast to the Cox mode, the log-baseline hazard is estimated along with other quantities (e.g. the log hazard ratios) by the Poisson GAM (PGAM) as:

log(h(t_{i,j})) = \beta_{0}+\lambda(t_{t,j})+\boldsymbol{x\beta} = \lambda_{I}(t_{i,j})+\boldsymbol{x\beta}

In the aforementioned expression, the baseline hazard is equivalently modeled as a time-varying deviation (\lambda(t) ) from a constant (the intercept \beta_{0}) , or as a time-varying function (\lambda_{I}(t) ). In the latter case, the constant is absorbed into the smooth term. The choice between these equivalent forms is dictated by the application at hand; in particular, the intercept may be switched on or off by centering the smooth terms appearing in the call to the gam function. Hence, in the PGAM formulation the log-baseline hazard is yet another covariate that one estimates by a smooth function; other covariates may modify this hazard in a proportional fashion by additively shifting the log-baseline hazard (\boldsymbol{x\beta}).

In the “standard” way of fitting a PGAM by mgcv, the log-baseline hazard is estimated in the constant+deviation form. Exponentiation may be used to derive the baseline hazard and its standard errors. Continuing the analysis of the Primary Biliary Cirrhosis example from the second part of the series, we may write:

par(mfrow=c(2,2))
plot(fGAM,main="Gauss Lobatto (5)",ylab="log-basehaz")
plot(fGAM2,main="Gauss Lobatto (10)",ylab="log-basehaz")
plot(fGAM,main="Gauss Lobatto (5)",ylab="basehaz",trans=exp)
plot(fGAM2,main="Gauss Lobatto (10)",ylab="basehaz",trans=exp)
Log Baseline (top row) and Baseline (second row) hazard function in the PBC dataset for two different discretizations of the data

Log Baseline (top row) and Baseline (second row) hazard function in the PBC dataset for two different discretizations of the data. In all these cases, the baseline hazard (or its log) are given as time varying deviations from a constant (the value of the log-hazard where the confidence interval vanishes)

There is no substantial difference in the estimated obtained by the coarse (Gauss Lobatto (5)) and finer (Gauss Lobatto (10)) discretization. Note that as a result of fitting the log-hazard as constant+ time-varying deviation, the standard error of the curve vanishes at ~1050: the value of the log-hazard at that instant in events per unit time is provided by the intercept term.

Estimation of the log-baseline hazard allows the PGAM to function as a parametric, smooth alternative to the Kaplan Meier estimator. This will be examined in the fourth part of this series.

 

Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and calculation of hazard ratios)

May 2, 2015

In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of sM individual observations at times \mathcal{F}=\left\{ F_{i}\right\} _{i=1}^{M} , with censoring indicators \mathcal{D}=\left\{ \delta_{i}\right\} _{i=1}^{M} assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

L=\prod_{i=1}^{M}f(F_{i})^{\delta_{i}}S(F_{i})^{1-\delta_{i}}=    \prod_{i=1}^{M}h(F_{i})^{\delta_{i}}\exp\left(-\int_{0}^{F_{i}}h(t)\mathrm{\, d}t\right)

where h(t) is the hazard function. By using an interpolatory quadrature rule, one may substitute the integral with a weighted sum evaluated at a distinct number of nodes.

L=\prod_{i=1}^{M}\prod_{j=1}^{N_{i}}h(t_{i})^{d_{i,j}}\exp\left(-w_{j,j}h(t_{i,j})\right)

where t_{i,j}, w_{i,j}  are the nodes, weights of the integration rule and d_{i,j} is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise.  By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights).  Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

A) relationship between (log) MST and the logarithm of the Maximum Hazard rate function for survival distributions with a cubic polynomial log baseline hazard function (B) Box plots of the GL error as a function of the number of nodes in the quadrature rule (C) GL error as a function of the length of the integration interval (taken equal to be equal to the MST for each distribution examined) for different orders of the quadrature rule (D) GL error as a function of the maximum value of the hazard rate for different orders of the quadrature rule.

Bounds of the Gauss Lobatto (GL) approximation error for the integration of survival data (MST=Mean Survival Time).

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme.  In our paper, this was accomplished by the adoption of the Gauss-Lobatto rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval [-1,1] depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

GaussLobatto<-function(N)
{
 N1<-N
 N<-N-1
 x=matrix(cos(pi*(0:N)/N),ncol=1)
 x=cos(pi*(0:N)/N)
 P<-matrix(0,N1,N1)
 xold<-2
 while (max(abs(x-xold))>2.22044604925031e-16) {
 xold<-x
 P[,1]<-1 
 P[,2]<-x
 
 for (k in 2:N) {
 P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k;
 }
 
 x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] )
 
 }

 w<-2./(N*N1*P[,N1]^2);
 ret<-list(x=rev(x),w=w)
 attr(ret,"order")<-N
 ret
}

which can be called to return a list of the nodes and their weights:

> GaussLobatto(5)
$x
[1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000

$w
[1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000

attr(,"order")
[1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:


GAMSurvdataset<-function(GL,data,fu,d)
## GL : Gauss Lobatto rule
## data: survival data
## fu: column number containing fu info
## d: column number with event indicator
 {
 ## append artificial ID in the set
 data$id<-1:nrow(data)
 Gllx<-data.frame(stop=rep(GL$x,length(data$id)),
 gam.dur=rep(GL$w,length(data$id)),
 t=rep(data[,fu],each=length(GL$x)),
 ev=rep(data[,d],each=length(GL$x)),
 id=rep(data$id,each=length(GL$x)),
 gam.ev=0,start=0)
 ## Change the final indicator to what
 ## was observed, map node positions,
 ## weights from [-1,1] back to the
 ## study time
 Gllx<-transform(Gllx,
 gam.ev=as.numeric((gam.ev | ev)*I(stop==1)),
 gam.dur=0.5*gam.dur*(t-start),
 stop=0.5*(stop*(t-start)+(t+start)))
 ## now merge the remaining covariate info
 Gllx<-merge(Gllx,data[,-c(fu,d)])
 Gllx
 }

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

data(pbc)
> ## Change transplant to alive
> pbc$status[pbc$status==1]<-0
> ## Change event code of death(=2) to 1
> pbc$status[pbc$status==2]<-1
> 
> head(pbc)
 id time status trt age sex ascites hepato spiders edema bili chol albumin copper
1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54
3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210
4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64
5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143
6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50
 alk.phos ast trig platelet protime stage
1 1718.0 137.95 172 190 12.2 4
2 7394.8 113.52 88 221 10.6 3
3 516.0 96.10 55 151 12.0 4
4 6121.8 60.63 92 183 10.3 4
5 671.0 113.15 72 136 10.9 3
6 944.0 93.00 63 NA 11.0 3
> 
> GL<-GaussLobatto(5)
> pbcGAM<-GAMSurvdataset(GL,pbc,2,3)
> head(pbcGAM)
 id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders
1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1
2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1
3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1
4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1
5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1
6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1
 edema bili chol albumin copper alk.phos ast trig platelet protime stage
1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
> 
> dim(pbc)
[1] 418 20
> dim(pbcGAM)
[1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable stop, their associated weights in the variable gam.dur and the event indicators are in the column gam.ev. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule ([-1,1]). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the log-hazard rate function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:


library(survival) ## for coxph
> library(mgcv) ## for mgcv
> 
> ## Prop Hazards Modeling with PGAM
> fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM,family="poisson",scale=1,method="REML")
> 
> ## Your Cox Model here
> f<-coxph(Surv(time,status)~trt+age+sex,data=pbc)
> 
> 
> summary(fGAM)

Family: poisson 
Link function: log

Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345236 0.655176 -15.790 < 2e-16 ***
trt 0.069546 0.181779 0.383 0.702 
age 0.038488 0.008968 4.292 1.77e-05 ***
sexf -0.370260 0.237726 -1.558 0.119 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.008 1.015 4.186 0.0417 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.249 Deviance explained = 2.25%
-REML = 693.66 Scale est. = 1 n = 1560
> f
Call:
coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc)


 coef exp(coef) se(coef) z p
trt 0.0626 1.065 0.182 0.344 7.3e-01
age 0.0388 1.040 0.009 4.316 1.6e-05
sexf -0.3377 0.713 0.239 -1.414 1.6e-01

Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 
 (106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:


GL<-GaussLobatto(10)
> pbcGAM2<-GAMSurvdataset(GL,pbc,2,3)
> fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM2,family="poisson",scale=1,method="REML")
> 
> summary(fGAM2)

Family: poisson 
Link function: log

Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345288 0.655177 -15.790 < 2e-16 ***
trt 0.069553 0.181780 0.383 0.702 
age 0.038487 0.008968 4.292 1.77e-05 ***
sexf -0.370340 0.237723 -1.558 0.119 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.003 1.005 4.163 0.0416 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.124 Deviance explained = 1.7%
-REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing  the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.

Survival Analysis With Generalized Additive Models : Part I (background and rationale)

May 1, 2015

After a really long break, I’d will resume my blogging activity. It is actually a full circle for me, since one of the first posts that kick started this blog, matured enough to be published in a peer-reviewed journal last week. In the next few posts I will use the R code included to demonstrate the survival fitting capabilities of Generalized Additive Models (GAMs) in real world datasets. The first post in this series will summarize the background, rationale and expected benefits to be realized by adopting GAMs from survival analysis.

In a nutshell, the basic ideas of the GAM approach to survival analysis are the following:

  1. One approximates the integral defining the survival function as a discrete sum using a quadrature rule
  2. One evaluates the likelihood at the nodes of the aforementioned quadrature rule
  3. A regression model is postulated for the log-hazard rate function
  4. As a result of 1-3 the survival regression problem is transformed into a Poisson regression one
  5. If penalized regression is used to fit the regression model, then GAM fitting software may be used for survival analysis

Ideas along the lines 1-4 have been re-surfacing in the literature ever since the Proportional Hazards Model was described. The mathematical derivations justifying Steps 1-4 are straightforward to follow and are detailed in the PLoS paper. The corresponding derivations for the Cox model are also described in a previous post.

Developments such as 1-4 were important in the first 10 years of the Cox model, since there were no off-the-shelf implementations of the partial (profile) likelihood approach. This limited the practical scope of proportional hazards modeling and set off a parallel computational line of research in how one could use other statistical software libraries to fit the Cox model.  In fact, the first known to the author application of a proportional model for the analysis of a National Institute of Health (NIH) randomized controlled trial used a software implementing a Poisson regression to calculate the hazard ratio. The trial was the NCDS trial that examined adequacy indices for the hemodialysis prescription (the description of the software was published 6 months prior to the clinical paper).  Many of these efforts were computationally demanding and died off as the Cox model was implemented in the various statistical packages after the late 80s and semi-parametric theory took off and provide a post-hoc justification for many of the nuances implicit in the Cox model.  Nevertheless, one can argue that in the era of the modern computer, no one really needs the Cox model. This technical report and the author’s work on a real world, complex dataset provides the personal background for my research on GAM approaches for survival data.

The GAM (or Poisson GAM, PGAM as called in the paper) is an extension of these old ideas (see the literature survey here and here). In particular, PGAM models the quantities that are modeled semi-parametrically (e.g. the baseline hazard) in the Cox model with parametric, flexible functions that are estimated by penalized regressio. One of the first applications of penalized regression for survival analysis is the Fine and Gray spline model, which is however not a PGAM model.  There are specific benefits to be realized from penalizing the Poisson regression and adopting GAMs  in the context of survival analysis:

  • Parsimony: degrees of freedom are not wasted as penalization will seek the most parsimonious representation (fewer degrees of freedom) among the many possible functions that may fit the data
  • Reduction of the analyst-degrees-of freedom: the shape of the functional relationships between survival and outcome are learned from the data. This limits the potential of someone putting a specific form for this relationship (e.g. linear v.s. quadratic) and running away with the most convenient p-value
  • Multiple time scale modelling: one can model more than one time scales in a dataset (i.e. not just the study time). This is useful when adjusting for secular trends in an observational dataset or even in a randomized trial. In particular cluster randomized trials at the community level may be immune to secular trends
  • Non-proportional hazards modeling: when the hazards are not proportional, the Cox model is not applicable. Many oncology datasets will demonstrate a deviation from proportionality (in fact we re-analyzed such a trial in the PLoS paper) . For a particular dataset, one would like to know whether the proportionality assumption is violated, and if so one would like to “adjust” for it. Such an adjustment will take the form of a time-varying hazard ratio function and these may be estimated with the PGAMs. In such a case, one can even extract an “average” hazard ratio while still estimating a time-varying deviation around it using the PGAM. However non-proportionality should shift the analyst to:
  • Alternative measures of treatment effect: These may include relative risks, absolute risk differences of even differences in the (Restricted) Mean Survival Time. Such measures are calculated from the time varying hazard ratios using statistical simulation techniques
  • Handling of correlated outcomes: Correlated outcomes may arise from center effects, multiple events in the same individual or even cluster randomization designs. The analysis of such outcomes is facilitated by the interpretation of the PGAM as a generalized linear mixed model and the introduction of the corresponding random effects and their correlation structure into the regression
  • Center effects: A variety of modeling options are available including stratified hazards, fixed or random effects
  • Subgroup analyses
  • Time varying external covariates
  • Unconditional/Conditional/Population averaged effects: The unconditional estimate is obtained by indexing the individuals with the group they are assigned to (e.g. placebo or drug in an interventional trial). The conditional estimate is obtained by introducing covariates (e.g. gender, age) into the regression to calculate effects for individuals that share these characteristics. The population effect averages over the conditional effects over all the individuals in the trial. In epidemiology it is known as the corrected group prognosis method. This was introduced in a JAMA paper almost 15 years ago as a way to generate adjusted survival probability curves
  • Handling of right censored/left truncated/uncensored data

These benefits follow directly from the  mixed model equivalence between semi-parametric, penalized regression and Generalized Mixed Linear Models. An excellent, survey may be found here, while Simon Wood’s book in the GAM implementation of the mgcv package in R contains a concise presentation of these ideas.

As it stands the method presented has no software  implementation similar to the survival package in R. Even though we provide R code to run the examples in the paper, the need for the various functions may not be entirely clear. Hence the next series of posts will go over the code and the steps required to fit the PGAM using the R programming language.

Your astrological sign can maim you, kill you or make you insane (or not?)

April 24, 2014

Can day of birth (as in here, here too and there) or  astrological sign affect later outcomes?

These are essentially the same question – astrological sign is determined by  the day of one’s birth but our reaction to the question is different. Asked in the first way, this is a legit question about an Epidemiological signal known since antiquity and is officially known as “season of birth effect” (SBE). (Historical trivia: the first use of this term may be found in Ellsworth’s Huntington book: Season of Birth: Its Relation to Human Abilities published in 1938). The SBE appears to be relevant for many diseases, outcomes and human populations and countries. Since we westerners are more likely to bite the bullet from cardiovascular disease or cancer, rest assured that the association has been noted for the former and the latter (though not everyone agrees)

Asked in the second way, the same question can immediately bring ridicule and shame on the scientifically oriented person who dares to ask it. In fact there is a small cottage industry of papers that use astrological sign in order to caution/warn/ridicule/discredit subgroup analyses in clinical datasets. The most famous of this is a subgroup analysis of the ISIS-2 but there are certainly others.

So what makes us think that the Season of Birth Effect is real and worthy of study across disciplines, while the astrological sign association is bogus and should be discarded any time it is detected? The answer may be found in the framing effect, a form of cognitive bias in which our reaction and even answer to the same question critically depends on the way it is framed. Though the framing effect is particularly relevant in sociology, it is also operational in medicine particularly affecting the choices patients make depending on the format of presentation of the same information.

So if framing effects are also operational in scientific discourse (at least in the medical field) how do we deal with them? The first way is to acknowledge their potential presence so as to minimize their impact by appropriate use of neutral terminology and avoid sensationalism when describing associations and effects. As the season-of-birth effect example shows one can present the same information to evoke sharply different responses, so one should always be on the look for emotions when reading a story in either the scientific or the lay press.

So to come back to the initial question: does one’s day of birth or astrological sign affect later outcomes? The associations seem to be there and are relatively consistent and reproducible so some phenomenon must be at play. Whether this phenomenon is real or a statistical analysis artifact (e.g. could age-period-cohort effects and the troubled history of the 20th century be at play here) is something I don’t know. But I would certainly be willing to find more about it, irrespective of whether the information is presented by month of birth or astrological sign 🙂