Archive for the ‘medicine’ Category

Shrinkage of predicted random effects in logistic regression models

July 30, 2015

As a follow-up of our initial investigation of the bias of random effect predictions in generalized linear mixed regression models, I undertook a limited simulation experiment. In this experiment, we varied the population average complication rate and the associated standard deviation of the individual random effect and generated a small number of replications (N=200) for each combination of population mean and standard deviation. Limited by the computational resources of (the otherwise wonderful!) tablet, I had to forego a more thorough and more precise (larger number of replicates) examination of bias, variance and coverage. The focus is on general patterns rather than precise estimates; by making the code public, these can be obtained by anyone who has access to the R computing environment.

The basic code that simulates  from these logistic regression models is shown below. Although we only examine small cluster sizes (number of observations per cluster 20-100) and a moderate number of individual random effects (200) it is straightforward to modify these aspects of the simulation study. For these simulations we examined a limited number of values for the standard deviation of the random effects (0.1, 0.2 and 0.5) and overall complication rates (0.05, 0.10, 0.15, 0.20) to reflect the potential range of values compatible with the raw data in the Surgeon Scorecard.

## helper functions
logit<-function(x) log(x/(1-x))
invlogit<-function(x) exp(x)/(1+exp(x))

## code allows a sensitivity analysis to the order of the nAGQ
## explanation of symbols/functions appearing inside the
## simulation function
## fit: glmer fit using a X point nAGQ integration rule
## ran: random effects using a X point nAGQ integration rule
##   cR: correlation between fitted and simulated random effect
##       the fitted were obtained with a X point nAGQ integrator
## int: intercept obtained with a X point nAGQ integration rule
## biG: bias of the fitted v.s. the simulated random effects
##      this is obtained as the intercept of a GAM regression of
##      the fitted versus the simulated values
## bi0: bias of the intercept from the glmer fit
## sdR: estimated standard deviation of the random eff distro
## bs0: bias of the standard deviation of the random effects (SDR)
## edf: non-linearity of the slope of a GAM regression of the
##      fitted v.s. the simulated residuals
## slQ: derivative of the non-linear slope at the Qth quantile
simFit<-function(X,complications,ind,pall,SDR,Q=0:10/10) {
ran<-ranef(fit)[["id"]][,1] ## predicted (mode) of random effects
## regress on the residuals to assess performance
fitg<-gam(ran~s(ind,bs="ts")) ## gam fit
edf<-sum(fitg$edf[-1])## edf of the non-linear "slope"

## function to simulate cases
simcase<-function(N,p) rbinom(N,1,p)
## simulation scenarios: fixed for each simulation
Nsurgeon<-200; # number of surgeons
Nmin<-20; # min number of surgeries per surgeon
Nmax<-100; # max number of surgeries per surgeon
Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon

## simulate individual surgical performance
## the reality consists of different combos of pall and the sd of the
## random effect

Nscenariosim<-200 ## number of simulations per scenario

## simulate indivindual scenarios
ind<-rnorm(Nsurgeon,0,sd) ; # surgical random effects
logitind<-logit(pall)+ind ; # convert to logits
pind<-invlogit(logitind); # convert to probabilities

id=factor(,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim))))



The datasets are generated by the function simScenario, which when applied over all combinations of population mean(pall, treated as a fixed effect) and random effect standard deviation (sd) generates the synthetic dataset (‘complications’ variable). Once the data have been generated the function simFit receives the synthetic dataset, fits the mixed logistic regression model and generates random effect predictions.

To assess and quantify shrinkage in these regression models, we compare the predicted random effects against the simulated values. Due to the potentially large number of these effects (10s of thousands in the Surgeon Score Card), we developed a methodology that looks for patterns in these bivariate relationships, taking into account the magnitude of each simulated and predicted random effect pair. This methodology rests on flexible regression between the predicted against the simulated random effect in each of the synthetic datasets. This flexible regression, using smoothing splines, generalizes linear regression and thus avoids the assumption that the shrinkage (as assessed by the slope of the linear regression line) is the same irrespective of the value of the simulated random effect. By numerically differentiating the spline at various quantiles of the simulated random effect, we thus have an estimate of shrinkage. In particular, this estimate gives the change in value of the predicted random effect for a unit change in the simulated one. The relevant code that fits the smoothing spline and differentiates it in (user defined) grid of random effects quantiles is shown below:

<pre>fitg<-gam(ran~s(ind,bs="ts")) ## gam fit
edf<-sum(fitg$edf[-1])## edf of the non-linear "slope"

A representative smoothing spline analysis of a mixed logistic regression model is shown in the figure below, along with the corresponding linear fit, and the line signifying zero shrinkage: a straight line passing from the origin with a unit slope. If the latter relationship is observed, then the predicted random effects are unbiased with respect to the simulated ones.


Predicted (y-axis) v.s. simulated (x-axis) random effects in a logistic regression model. 2000 random effects were simulated with small cluster sizes (20-100). Shown are the flexible spline fits (green), linear regression fit (red) and the line of no shrinkage (blue)

This figure shows the potential for random effect predictions to be excessively shrunken relative to the simulated ones (note that the blue line is more horizontal relative to the blue one). One should also observe that the amount of shrinkage is not the same throughout the range of the random effects: positive values (corresponding to individuals with higher complication rates) are shrank relatively less (the green line is closer to the blue line), relative to negative random effects (complication rates less than the population average). The net effect of this non-linear relationship is that the separation between “good” (negative RE) and “bad” (positive values of the RE) is decreased in the predicted relative to the simulated random effects. This can be a rather drastic compression in dynamic range for small cluster sizes as we will see below. 

From each mixed logistic regression model we obtain a large number of information : i) the bias in the fixed effect estimate, ii) the bias in the population standard deviation, iii) the bias in the overall relationship between predicted and simulated residuals (indirectly assessing whether the random effects are centered around the fixed effect) iv) the non-linearity of the relationship between the predicted and the simulated random effects (this is given by the estimated degrees of freedom of the spline, with an edf = 1 signifying a linear fit and higher degrees of freedom non linear relationships; edfs that are less than one signify a more complex pattern of shrinkage of a variable proportion of the random effects to a single point, i.e. the fixed effect) v) the linear slope

There was no obvious bias in the estimation of the intercept (the only fixed effect in our simulation study) for various combinations of overall event rate and random effect standard deviation (but note higher dispersion for large standard deviation values, indicating the need for a higher number of simulation replicates in order to improve precision):

Bias in the estimation of the intercept in the mixed logistic regression model (shown in log scale)

Bias in the estimation of the intercept in the mixed logistic regression model (shown in log scale)


Similarly there was no bias in the estimation of the population standard deviation of the random effects:

Bias in the standard deviation of the random effects

Bias in the standard deviation of the random effects

The estimated relationship between predicted and simulated residuals was in linear for approximately 50% of simulated samples for small to moderate standard deviations. However it was non-linear (implying differential shrinkage according to random effect magnitude) for larger standard deviations, typical of the standard deviation of the surgeon random effect in the Surgeon Score Card.

Estimated degrees of freedom (a measure of non-linearity) in the relationship between predicted and simulated random effects over simulation replicates.

Estimated degrees of freedom (a measure of non-linearity) in the relationship between predicted and simulated random effects over simulation replicates. The red horizontal line corresponds to a linear relationship between predicted and simulated random effects (edf=1).Edf<1 signify the shrinkage of a variable proportion of random effects to zero


The magnitude of the slopes of the smoothing splines at different quartiles of the random effects distribution and for combinations of overall rate and random effect standard deviation are shown below:

Slope of the smoothing spline at different quantiles (20%, 50%, 80%) of the random effect distribution. Boxplots are over 200 replicates of each combination of overall rate and random effect standard deviation

Slope of the smoothing spline at different quantiles (20%, 50%, 80%) of the random effect distribution. Boxplots are over 200 replicates of each combination of overall rate and random effect standard deviation

There are several things to note in this figure:

  1. the amount of shrinkage decreases as the population standard deviation increases (going from left to right in each row of the figure the slopes increase from zero towards one)
  2. the amount of shrinkage decreases as the overall average increases for the same value of the standard deviation (going from top to bottom in the same column)
  3. the degree of shrinkage appears to be the same across the range of random effect quantiles for small to moderate values of the population standard deviation
  4. the discrepancy between the degree of shrinkage is maximized for larger values of the standard deviation of the random effects and small overall rates (top right subpanel). This is the situation that is more relevant for the Surgeon Score Card based on the analyses reported by this project.

What are the practical implications of these observations for the individual surgeon comparison reported in the Score Card project? These are best understood by a numerical example from one of these 200 hundred datasets shown in the top right subpanel. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios, irrespective of whether they are predicted or (the simulated) true effects. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good Bad Quantile (Good) Quantile (Bad) True OR Pred OR Shrinkage Factor
-0.050 0.050 48.0 52.0 0.905 0.959 1.06
-0.100 0.100 46.0 54.0 0.819 0.920 1.12
-0.150 0.150 44.0 56.0 0.741 0.883 1.19
-0.200 0.200 42.1 57.9 0.670 0.847 1.26
-0.250 0.250 40.1 59.9 0.607 0.813 1.34
-0.300 0.300 38.2 61.8 0.549 0.780 1.42
-0.350 0.350 36.3 63.7 0.497 0.749 1.51
-0.400 0.400 34.5 65.5 0.449 0.719 1.60
-0.450 0.450 32.6 67.4 0.407 0.690 1.70
-0.500 0.500 30.9 69.1 0.368 0.662 1.80
-0.550 0.550 29.1 70.9 0.333 0.635 1.91
-0.600 0.600 27.4 72.6 0.301 0.609 2.02
-0.650 0.650 25.8 74.2 0.273 0.583 2.14
-0.700 0.700 24.2 75.8 0.247 0.558 2.26
-0.750 0.750 22.7 77.3 0.223 0.534 2.39
-0.800 0.800 21.2 78.8 0.202 0.511 2.53
-0.850 0.850 19.8 80.2 0.183 0.489 2.68
-0.900 0.900 18.4 81.6 0.165 0.467 2.83
-0.950 0.950 17.1 82.9 0.150 0.447 2.99
-1.000 1.000 15.9 84.1 0.135 0.427 3.15
-1.050 1.050 14.7 85.3 0.122 0.408 3.33
-1.100 1.100 13.6 86.4 0.111 0.390 3.52
-1.150 1.150 12.5 87.5 0.100 0.372 3.71
-1.200 1.200 11.5 88.5 0.091 0.356 3.92
-1.250 1.250 10.6 89.4 0.082 0.340 4.14
-1.300 1.300 9.7 90.3 0.074 0.325 4.37
-1.350 1.350 8.9 91.1 0.067 0.310 4.62
-1.400 1.400 8.1 91.9 0.061 0.297 4.88
-1.450 1.450 7.4 92.6 0.055 0.283 5.15
-1.500 1.500 6.7 93.3 0.050 0.271 5.44
-1.550 1.550 6.1 93.9 0.045 0.259 5.74
-1.600 1.600 5.5 94.5 0.041 0.247 6.07
-1.650 1.650 4.9 95.1 0.037 0.236 6.41
-1.700 1.700 4.5 95.5 0.033 0.226 6.77
-1.750 1.750 4.0 96.0 0.030 0.216 7.14
-1.800 1.800 3.6 96.4 0.027 0.206 7.55
-1.850 1.850 3.2 96.8 0.025 0.197 7.97
-1.900 1.900 2.9 97.1 0.022 0.188 8.42
-1.950 1.950 2.6 97.4 0.020 0.180 8.89
-2.000 2.000 2.3 97.7 0.018 0.172 9.39
-2.050 2.050 2.0 98.0 0.017 0.164 9.91

From these table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to be observedin real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Furthermore, the comparison of  surgeon performance based on predicted random effects is likely to be misleading due to over-shrinkage.





Survival Analysis With Generalized Additive Models: Part V (stratified baseline hazards)

May 9, 2015

In the fifth part of this series we will examine the capabilities of Poisson GAMs to stratify the baseline hazard for survival analysis. In a stratified Cox model, the baseline hazard is not the same for all individuals in the study. Rather, it is assumed that the baseline hazard may differ between members of groups, even though it will be the same for members of the same group.

Stratification is one of the ways that one may address the violation of the proportionality assumption for a categorical covariate in the Cox model. The stratified Cox model resolves the overall hazard in the study as:

h_{g}(t,X) = h_{0_{g}}(t)exp(\boldsymbol{x\beta}) ,\quad g=1,2,\dotsc ,g_{K}

In the logarithmic scale, the multiplicative model for the stratified baseline hazard becomes an additive one. In particular, the specification of a different baseline hazard for the different levels of a factor amounts to specifying an interaction between the factor and the smooth baseline hazard in the PGAM.

We turn to the PBC dataset to provide an example of a stratified analysis with either the Cox model or the PGAM. In that dataset the covariate edema is a categorical variable assuming the values of 0 (no edema), 0.5 (untreated or successfully treated) and 1(edema despite treatment). An analysis of the Schoenfeld residual test shows that this covariate violates the proportionality assumption

> f<-coxph(Surv(time,status)~trt+age+sex+factor(edema),data=pbc)
> Schoen<-cox.zph(f)
> Schoen
rho chisq p
trt -0.089207 1.12e+00 0.2892
age -0.000198 4.72e-06 0.9983
sexf -0.075377 7.24e-01 0.3950
factor(edema)0.5 -0.202522 5.39e+00 0.0203
factor(edema)1 -0.132244 1.93e+00 0.1651
GLOBAL NA 8.31e+00 0.1400

To fit a stratified GAM model, we should transform the dataset to include additional variables, one for each level of the edema covariate. To make the PGAM directly comparable to the stratified Cox model, we have to fit the former without an intercept term. This requires that we include additional dummy variables for any categorical covariates that we would to adjust our model for. In this particular case, the only other additional covariate is the female gender:



Then the stratifed Cox and PGAM models are fit as:



In general the values of covariates of the stratified Cox and the PGAM models are similar with the exception of the trt variable. However the standard error of this variable estimated by either model is so large, that the estimates are statistically no different from zero, despite their difference in magnitude

> fs
coxph(formula = Surv(time, status) ~ trt + age + sex + strata(edema), 
 data = pbc)

 coef exp(coef) se(coef) z p
trt 0.0336 1.034 0.18724 0.18 0.86000
age 0.0322 1.033 0.00923 3.49 0.00048
sexf -0.3067 0.736 0.24314 -1.26 0.21000

Likelihood ratio test=15.8 on 3 df, p=0.00126 n= 312, number of events= 125 
 (106 observations deleted due to missingness)
> summary(fGAM)

Family: poisson 
Link function: log 

gam.ev ~ s(stop, bs = "cr", by = edema0) + s(stop, bs = "cr", 
 by = edema05) + s(stop, bs = "cr", by = edema1) + trt + age + 
 sexf + offset(log(gam.dur)) - 1

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
trt 0.002396 0.187104 0.013 0.989782 
age 0.033280 0.009170 3.629 0.000284 ***
sexf -0.297481 0.240578 -1.237 0.216262 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop):edema0 2.001 2.003 242.0 <2e-16 ***
s(stop):edema05 2.001 2.001 166.3 <2e-16 ***
s(stop):edema1 2.000 2.001 124.4 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) = -0.146 Deviance explained = -78.4%
REML score = 843.96 Scale est. = 1 n = 3120</pre>

Survival Analysis With Generalized Additive Models : Part IV (the survival function)

May 3, 2015

The ability of PGAMs to estimate the log-baseline hazard rate, endows them with the capability to be used as smooth alternatives to the Kaplan Meier curve. If we assume for the shake of simplicity that there are no proportional co-variates in the PGAM regression, then the quantity modeled  corresponds to the log-hazard of the  survival function. Note that the only assumptions made by the PGAM is that the a) log-hazard is a smooth function, with b) a given maximum complexity (number of degrees of freedom) and c) continuous second derivatives. A PGAM provides estimates of the log-hazard constant, \beta_{0}, and the time-varying deviation, \lambda(t_{i,j}). These can be used to predict the value of the survival function, S(t), by approximating the integral appearing in the definition of S(t) by numerical quadrature.

S(t_{i})=\exp\left(-\int_{0}^{t_{i}}h(t)\mathrm{\, d}t\right)\approx\exp\left(-\sum_{j=1}^{N_{i}}w_{i,j}\exp(\beta_{0}+\lambda(t_{i,j}))\right)

From the above definition it is obvious that the value of the survival distribution at any given time point is a non-linear function of the PGAM estimate. Consequently, the predicted survival value, S_{pred}(t), cannot be derived in closed form; as with all non-linear PGAM estimates, a simple Monte Carlo simulation algorithm may be used to derive both the expected value of \hat{S}_{pred}(t) and its uncertainty. For the case of the survival function, the simulation steps are provided in Appendix (Section A3) of our paper. The following R function can be used to predict the survival function and an associated confidence interval at a grid of points. It accepts as arguments a) the vector of time points, b) a PGAM object for the fitted log-hazard function, c) a list with the nodes and weights of a Gauss-Lobatto rule for the integration of the predicted survival, d) the number of Monte Carlo samples to obtain and optionally e) the seed of the random number generation. Of note, the order of the quadrature used to predict the survival function is not necessarily the same as the order used to fit the log-hazard function.

## Calculate survival and confidence interval over a grid of points
## using a GAM
##         t : time at which to calculate relative risk
##        gm : gam model for the fit
##   gl.rule : GL rule (list of weights and nodes)
##        CI : CI to apply
##      Nsim : Number of replicates to draw
##      seed : RNG seed
## create the nonlinear contrast
start<-0; ## only for right cens data
## map the weights from [-1,1] to [start,t]
## expand the dataset
## linear predictor at each node
Xp <- predict(gm,newdata=df,type="lpmatrix")
## draw samples
br <- rmvn(Nsim,coef(gm),gm$Vp)
for(i in 1:Nsim){
## hazard function at the nodes
## cumumative hazard
chz1<-gl.rule$w %*% hz[1:L,]


The function makes use of another function, Survdataset, that expands internally the vector of time points into a survival dataset. This dataset is used to obtain predictions of the log-hazard function by calling the predict function from the mgcv package.

## Function that expands a prediction dataset
## so that a GL rule may be applied
## Used in num integration when generating measures of effect
## GL  : Gauss Lobatto rule
## data: survival data
##   fu: column number containing fu info

## append artificial ID in the set
## Change the final indicator to what
## was observed, map node positions,
## weights from [-1,1] back to the
## study time
## now merge the remaining covariate info


The ability to draw samples from the multivariate normal distribution corresponding to the model estimates and its covariance matrix is provided by another function, rmvn:

## function that draws multivariate normal random variates with
## a given mean vector and covariance matrix
##    n : number of samples to draw
##   mu : mean vector
##  sig : covariance matrix
rmvn <- function(n,mu,sig) { ## MVN random deviates
L <- mroot(sig);m <- ncol(L);
t(mu + L%*%matrix(rnorm(m*n),m,n))

To illustrate the use of these functions we revisit the PBC example from the 2nd part of this blog series. Firstly, let’s obtain a few Gauss-Lobatto lists of weights/nodes for the integration of the survival function:

## Obtain a few Gauss Lobatto rules to integrate the survival
## distribution

Subsequently, we fit the log-hazard rate to the coarsely (5 nodes) and more finely discretized (using a 10 point Gauss Lobatto rule) versions of the PBC dataset, created in Part 2. The third command obtains the Kaplan Meier estimate in the PBC dataset.



We obtained survival probability estimates for the 6 combinations of time discretization for fitting (either a 5 or 10th order Lobatto rule) and prediction (a 5th, 10th or 20th order rule):


In all cases 1000 Monte Carlo samples were obtained for the calculation of survival probability estimates and their pointwise 95% confidence intervals. We can plot these against the Kaplan Meier curve estimates:

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL5)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL10)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL20)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL5)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL10)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL20)")

Survival probability estimates: Kaplan Meier curve (black) v.s. the PGAM estimates for different orders of Gauss Lobatto (GL) quadrature

Survival probability estimates: Kaplan Meier curve (black) v.s. the PGAM estimates for different orders of Gauss Lobatto (GL) quadrature

Overall, there is a close agreement between the Kaplan Meier estimate and the PGAM estimates despite the different function spaces that the corresponding estimators “live”: the space of all piecewise constant functions (KM) v.s. that of the smooth functions with bounded, continuous second derivatives (PGAM). Furthermore, the 95% confidence interval of each estimator (dashed lines) contain the expected value of the other estimator. This suggests that there is no systematic difference between the KM and the PGAM survival estimators. This was confirmed in simulated datasets (see Fig 2 in our PLoS One paper).

Survival Analysis With Generalized Additive Models : Part III (the baseline hazard)

May 2, 2015

In the third part of the series on survival analysis with GAMs we will review the use of the baseline hazard estimates provided by this regression model. In contrast to the Cox mode, the log-baseline hazard is estimated along with other quantities (e.g. the log hazard ratios) by the Poisson GAM (PGAM) as:

log(h(t_{i,j})) = \beta_{0}+\lambda(t_{t,j})+\boldsymbol{x\beta} = \lambda_{I}(t_{i,j})+\boldsymbol{x\beta}

In the aforementioned expression, the baseline hazard is equivalently modeled as a time-varying deviation (\lambda(t) ) from a constant (the intercept \beta_{0}) , or as a time-varying function (\lambda_{I}(t) ). In the latter case, the constant is absorbed into the smooth term. The choice between these equivalent forms is dictated by the application at hand; in particular, the intercept may be switched on or off by centering the smooth terms appearing in the call to the gam function. Hence, in the PGAM formulation the log-baseline hazard is yet another covariate that one estimates by a smooth function; other covariates may modify this hazard in a proportional fashion by additively shifting the log-baseline hazard (\boldsymbol{x\beta}).

In the “standard” way of fitting a PGAM by mgcv, the log-baseline hazard is estimated in the constant+deviation form. Exponentiation may be used to derive the baseline hazard and its standard errors. Continuing the analysis of the Primary Biliary Cirrhosis example from the second part of the series, we may write:

plot(fGAM,main="Gauss Lobatto (5)",ylab="log-basehaz")
plot(fGAM2,main="Gauss Lobatto (10)",ylab="log-basehaz")
plot(fGAM,main="Gauss Lobatto (5)",ylab="basehaz",trans=exp)
plot(fGAM2,main="Gauss Lobatto (10)",ylab="basehaz",trans=exp)
Log Baseline (top row) and Baseline (second row) hazard function in the PBC dataset for two different discretizations of the data

Log Baseline (top row) and Baseline (second row) hazard function in the PBC dataset for two different discretizations of the data. In all these cases, the baseline hazard (or its log) are given as time varying deviations from a constant (the value of the log-hazard where the confidence interval vanishes)

There is no substantial difference in the estimated obtained by the coarse (Gauss Lobatto (5)) and finer (Gauss Lobatto (10)) discretization. Note that as a result of fitting the log-hazard as constant+ time-varying deviation, the standard error of the curve vanishes at ~1050: the value of the log-hazard at that instant in events per unit time is provided by the intercept term.

Estimation of the log-baseline hazard allows the PGAM to function as a parametric, smooth alternative to the Kaplan Meier estimator. This will be examined in the fourth part of this series.


Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and calculation of hazard ratios)

May 2, 2015

In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of sM individual observations at times \mathcal{F}=\left\{ F_{i}\right\} _{i=1}^{M} , with censoring indicators \mathcal{D}=\left\{ \delta_{i}\right\} _{i=1}^{M} assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

L=\prod_{i=1}^{M}f(F_{i})^{\delta_{i}}S(F_{i})^{1-\delta_{i}}=    \prod_{i=1}^{M}h(F_{i})^{\delta_{i}}\exp\left(-\int_{0}^{F_{i}}h(t)\mathrm{\, d}t\right)

where h(t) is the hazard function. By using an interpolatory quadrature rule, one may substitute the integral with a weighted sum evaluated at a distinct number of nodes.


where t_{i,j}, w_{i,j}  are the nodes, weights of the integration rule and d_{i,j} is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise.  By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights).  Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

A) relationship between (log) MST and the logarithm of the Maximum Hazard rate function for survival distributions with a cubic polynomial log baseline hazard function (B) Box plots of the GL error as a function of the number of nodes in the quadrature rule (C) GL error as a function of the length of the integration interval (taken equal to be equal to the MST for each distribution examined) for different orders of the quadrature rule (D) GL error as a function of the maximum value of the hazard rate for different orders of the quadrature rule.

Bounds of the Gauss Lobatto (GL) approximation error for the integration of survival data (MST=Mean Survival Time).

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme.  In our paper, this was accomplished by the adoption of the Gauss-Lobatto rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval [-1,1] depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

 while (max(abs(x-xold))>2.22044604925031e-16) {
 for (k in 2:N) {
 P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k;
 x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] )


which can be called to return a list of the nodes and their weights:

> GaussLobatto(5)
[1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000

[1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000

[1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:

## GL : Gauss Lobatto rule
## data: survival data
## fu: column number containing fu info
## d: column number with event indicator
 ## append artificial ID in the set
 ## Change the final indicator to what
 ## was observed, map node positions,
 ## weights from [-1,1] back to the
 ## study time
 gam.ev=as.numeric((gam.ev | ev)*I(stop==1)),
 ## now merge the remaining covariate info

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

> ## Change transplant to alive
> pbc$status[pbc$status==1]<-0
> ## Change event code of death(=2) to 1
> pbc$status[pbc$status==2]<-1
> head(pbc)
 id time status trt age sex ascites hepato spiders edema bili chol albumin copper
1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54
3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210
4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64
5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143
6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50
 alk.phos ast trig platelet protime stage
1 1718.0 137.95 172 190 12.2 4
2 7394.8 113.52 88 221 10.6 3
3 516.0 96.10 55 151 12.0 4
4 6121.8 60.63 92 183 10.3 4
5 671.0 113.15 72 136 10.9 3
6 944.0 93.00 63 NA 11.0 3
> GL<-GaussLobatto(5)
> pbcGAM<-GAMSurvdataset(GL,pbc,2,3)
> head(pbcGAM)
 id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders
1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1
2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1
3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1
4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1
5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1
6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1
 edema bili chol albumin copper alk.phos ast trig platelet protime stage
1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
> dim(pbc)
[1] 418 20
> dim(pbcGAM)
[1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable stop, their associated weights in the variable gam.dur and the event indicators are in the column gam.ev. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule ([-1,1]). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the log-hazard rate function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:

library(survival) ## for coxph
> library(mgcv) ## for mgcv
> ## Prop Hazards Modeling with PGAM
> fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM,family="poisson",scale=1,method="REML")
> ## Your Cox Model here
> f<-coxph(Surv(time,status)~trt+age+sex,data=pbc)
> summary(fGAM)

Family: poisson 
Link function: log

gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345236 0.655176 -15.790 < 2e-16 ***
trt 0.069546 0.181779 0.383 0.702 
age 0.038488 0.008968 4.292 1.77e-05 ***
sexf -0.370260 0.237726 -1.558 0.119 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.008 1.015 4.186 0.0417 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.249 Deviance explained = 2.25%
-REML = 693.66 Scale est. = 1 n = 1560
> f
coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc)

 coef exp(coef) se(coef) z p
trt 0.0626 1.065 0.182 0.344 7.3e-01
age 0.0388 1.040 0.009 4.316 1.6e-05
sexf -0.3377 0.713 0.239 -1.414 1.6e-01

Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 
 (106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:

> pbcGAM2<-GAMSurvdataset(GL,pbc,2,3)
> fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM2,family="poisson",scale=1,method="REML")
> summary(fGAM2)

Family: poisson 
Link function: log

gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345288 0.655177 -15.790 < 2e-16 ***
trt 0.069553 0.181780 0.383 0.702 
age 0.038487 0.008968 4.292 1.77e-05 ***
sexf -0.370340 0.237723 -1.558 0.119 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.003 1.005 4.163 0.0416 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.124 Deviance explained = 1.7%
-REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing  the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.

Of means, standard deviations and blood pressure trials

December 26, 2013

Last week, the long awaited Joint National Commission (JNC) released their 8th revision of their hypertension management guidelines (more than 10 years after the 7th version!). JNC 8 will join the plethora of guidelines (ASH/ISH, ESH, KDIGO and the ACC/AC advisory that will be upgraded to full guideline set later on) that may be used (or not) to justify (or defend) treating (or not treating) patients with elevated blood pressure (BP) down to pre-approved, committee blessed targets. Though it is not my intention to go into the guidelines (since opinions are like GI tracts: everyone has one, whereas biases are like kidneys: everyone has at least one and usually two), it is interesting to seize this opportunity and revisit one of the largest blood pressure trials that is used to defend certain aspects of the guidelines recommendations.

The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT) is one of the largest hypertension trials and was all the rage 12 years ago when failing to meet its primary endpoint i.e. the superiority of various first line antihypertensive agents, it reported differences in various secondary outcomes (including blood pressure control) favoring the oldest (chlorothalidone) v.s. the newest (lisinopril/amlodipine) medication. Since 2002 all agents used in ALLHAT have gone generic (and thus have the same direct costs), diluting the economic rationale for choosing one therapy v.s. the other. Nevertheless the literature reverbated for years (2003 , 2007 v.s. 2009) with intense debates about what ALLHAT showed (or didn’t show). In particular the question of the blood pressure differences observed among the treatment arms was invoked as one of the explanations for the diverging secondary outcomes in ALLHAT. In its simplest form this argument says that the differences of 0.8 (amlodipine)/2 (lisinopril) mmHg of systolic BP over chlorothalidone may account for some of the differences in outcomes.

I was exposed to this argument when I was an internal medicine resident (more than a decade ago) and it did not really make a lot of sense to me: could I be raising the risk for cardiovascular disease, stroke, and heart failure by 10-19% of the patient I was about to see in my next clinic appointment by not going the extra mile to bring his blood pressure by 2 mmHg? Was my medical license, like the 00 in 007’s designation,  a license to kill (rather than heal)? Even after others explained that the figures reported in the paper referred to the difference in the average blood pressure between the treatment arms, not the difference of blood pressure within each patient I could not get to my self to understand the argument. So more than a decade after the publication of ALLHAT I revisited the paper and tried to infer what the actual blood pressures may have had been (no access to individual blood pressure data) during the five years of the study. The relevant information may be found in Figure 3 (with a hidden bonus in Figure 1) of the JAMA paper which are reproduced in a slightly different form below (for copyright reasons):


The figure shows the Systolic (SBP) and Diastolic (DBP) pressures of the participants who continued to be followed up during the ALLHAT study; unsurprisingly for a randomized study there were almost no differences before the study (T=0). Differences did emerge during follow-up and in fact the mean/average (“dot”) in the graph was lower for chlorothalidone v.s. both amlodipine and lisinopril during the study. However, the standard deviation (the dispersion of the individuals receiving the study drugs around the mean: shown as the height of vertical line) was also larger (by about 17%) for lisinopril suggesting that there both more people with higher and lower blood pressures compared to chlorothalidone.

This pattern is also evident if one bothers to take a look at the flowchart of the study (Figure 1) which lists patterns/reasons for discontinuation during the 5th year. Among the many reasons, “Blood Pressure Elevation” and “Blood Pressure Too Low” are listed:

Chlorothalidone Amlodipine Lisinopril
N (total)
8083 4821 5004
BP High 84 38 131
BP Too Low 91 51 76
Other reasons for discontinuation
1698 963 1192

A garden variety logistic regression for the odds of discontinuation shows there is no difference between amlodipine and chlorothalidone due to high (p=0.16) or low (p=0.74) BP. On the other hand, the comparison betweeen lisinopril and chlorothalidone is more interesting:

  • Odds of discontinuation due to low BP: 1.44 95%CI: 1.00-2.06 (p=0.044)
  • Odds of discontinuation due to high BP: 3.38 95%CI: 2.35-4.87 (p<0.001)

So despite the higher mean, the higher standard deviation implies that there more patients with low (and too low) BP among the recipients of lisinopril. In other words, blood pressure control was more variable with lisinopril compared to the other two drugs: this variability can be quantified by using the reported means/standard deviations to look at the cumulative percentage of patients with BP below a given cutoff (for simplicity we base the calculations on the year 3 data):


So it appears that lisinopril (at least as used in ALLHAT) was able to control more patients to < 120/70 (which are low levels based on the current guidelines), but fewer patients at the higher end of the BP spectrum. The clinical translation of this observation is that there will be patients who will be ideal candidates for lisinopril (maybe to the point where dose reduction is necessary) and others who will fail to respond, so that individualization of therapy, rather than one size fits all is warranted. Such individualization may be achieved either on the basis of short shared physician/patient decision making, n of 1 trials, biomarker levels (e.g. home blood pressure measurements) or demographic profiling (as is done in JNC8 for African American patients).

Notwithstanding these comments, one is left scratching one’s head with the following questions:

  • who were the patients with an exaggerated and dampened out response to lisinopril in ALLHAT
  • could the variability in BP control provide a much better explanation for the variability in secondary outcomes in ALLHAT? (the investigators did apply what is known as a time-updated analysis using the observed BPs during the trial, but this is not the statistically proper way to analyze this effect in the presence of loss-to-follow up and possibly informative censoring)
  • what are the clinical implications of lowering BP to a given level when this is done with different classes of agents? This question is related to the previous one and both are not unaswerable with time-updated models of endogenous (such as BP readings) variables

At a more abstract level, should be scrutinize paper tables for the means as well as the standard deviations of response variables looking for hidden patterns that may not evident at a first look? In the clinic one is impressed with the variability of the patient responses to interventions, yet this variability is passed over when analyzing, reporting and discussing trial results in which we only look at the means. To me this is seems a rather deep inconsistency between our behaviours and discourse with our Clinician v.s. our Evidence Basis Medicine hats on, which may even decrease the chances of finding efficacious and effective treatments. Last but certainly not least, how can be begin to acknowledge variability in trial design, execution, analysis and reporting so as to better reflect what actually happens in the physical world, rather than the ivory towers of our statistical simulators?


The agnostic approach to effectiveness

September 3, 2013
To the agnostic, real world experience reins supreme when it comes to evaluating effectiveness. The Randomized Controlled Trial (RCT) results are only relevant to the extent they demonstrate that both success and failure are indeed possible when using a new therapy. However, the numerical (efficacy) estimates are not relevant for the agnostic and should be discounted when real world effectiveness is to be assessed. The latter can only be ascertained by looking at the outcomes of actual patients so in order to acknowledge the RCT results, they too need to be cast in this format. The extreme agnostic attitude would do so in a manner that minimizes to the greatest possible extent the impact of the reported RCT efficacy on inferences, e.g. by appraising it as worthy of one success and one failure in real world patients. These “pseudocases” are added to the corresponding numbers obtained in the real world and the agnostic proceeds to apply Bayes theorem. Mathematically the agnostic assumes a prior in which the prior probability of success can be any number between 0  and 1 (the Laplace prior) to represent the two extreme viewpoints of the therapy, i.e. it is either rat poisson or holy water.
The agnostic attitude will not mislead the clinician, even for small number of real world cases:
but it will severly impair the clinician’s ability to make precise statements:
until the time that an extremely large body of evidence has been analyzed:
Hence, the agnostic does pay a premium by not fully acknowledging the trial results. This premium, which can be described as being too uncertain about one’s uncertainty may even have practical implications e.g. if a patient decides against the new therapy because of the clinician’s uncertainty regarding the therapy of evidence. It is important thus for the skeptic to keep in mind that lack of evidence is not evidence of lack!

The agnostic approach to effectiveness

September 3, 2013

The second attitude towards effectiveness is an agnostic one; such a clinician  considers previous real world experience more relevant than efficacy figures when it comes to assessing effectiveness. Randomized Controlled Trial (RCT) results are not dismissed, but are discounted nonetheless at a very high rate. To the agnostic reading about the efficacy of an intervention in an RCT only implies that both success and failure are possible outcomes but the reported figures are not relevant.  If the outcomes in real world patients rein supreme, then the trial results should be quantified in such a way as to correspond to such an experience. An extreme version of agnosticism will mathematically translate such an assessment to have the minimum possible influence, or equivalently the least number of “additional” patients that should be added to the real world record: one success and one failure “pseudocase”.

Four basic attitudes towards efficacy and its relation to effectiveness

September 1, 2013

I will continue this series of posts regarding the appraisal of efficacy (“how well a therapeutic intervention worked in a randomized experiment”) and its translation to statements about effectiveness (“how well the intervention worked in the real world”), by considering the attitudes that one adopt towards these issues. The aim is to develop a sophisticated approach, or rather a vantage point that one would almost always want to adopt when considering the implications of having data about the efficacy (results of a trial) and effectiveness (success rate in real world practice). However the vantage point will only become evident by considering a basic set of attitudes, which are described here: (more…)

Reflections on Effectiveness = Efficacy / 2

August 20, 2013

My recent post on the initial appraisal of a therapy’s effectiveness based on a randomized trial reporting specific data about its efficacy generated some interesting comments on twitter. In particular, David Juurlink (@DavidJuurlink) commented:

@ChristosArgyrop @medskep meant on E=E/2, which I’d say the post-RALES experience with spironolactone invalidates

The point that David made seems to be that that the formula Effectiveness = Efficacy/2 is way too conservative and that the post-RALES experience illustrates this point. This is a great objection, one that lies at the heart of inductive reasoning which is what we essentially do when we speak about either effectiveness or efficacy. To answer this objection (both in its specific post-RALES and its more general form) I will need a couple of posts but first I believe a little bit of background is called for.

RALES was a landmark trial, published almost 15 years ago, about a novel approach (a drug called spironolactone) to treating heart failure, a condition with a very high mortality and hospitalization rate. RALES showed an almost 30% reduction in the risk of death and was a paradigm shifting study: immediately after the publication of RALES predescriptions of spironolactone increased worldwide and in 2013 many of these patients are taking on spironolactone-like.
It is the personal opinion of many cardiologists (and mine) that spironolactone saved the life’s of their real world patients (ie the drug is effective), yet the published track record is not that clear, with partially mixed evaluations of outcomes at least in the elderly and safety concerns (in my opinion,also held by others, almost entirely due to wild extrapolation of study results  inappropriate use and inadequate monitoring by prescribing physicians). It is precisely such considerations that called for further evaluation of the efficacy, effectiveness and safety of the drug almost immediately (in the time scale of clinical research) after the publication of RALES.

So the Effectiveness = Efficacy/2 shorthand formula seems to be vindicated by the track record of spironolactone since RALES. However I would go even further and claim that the drug would have not missed on its potential to save lives in the real world and be more widely used today had this viewpoint been adopted from the outset.

To see why note that the rule has a companion concerning what I call the ‘sail-through rate’: the proportion of real-world patients who will not experience an adverse event while taking the therapy. The healthy skeptic who is using the trial data and nothing else should also expect the sail-through rate to be half the one reported in the trial publication.
Hence, the combination of these too evaluations (which are really flipsides of the same mathematical coin) might have led a more cautious adoption of spironolactone as physicians would have halved their initial expectations about the benefit and the lack of trouble. What happened is that the prescriptions increased by 700% and complications (mainly hyperkalemia and renal failure/insufficiency/injury) sky-rocketed as physicians held out the scripts in expectation of benefit for their patients. The skeptic approach might have led one to become more familiar with the drug’s pharmacological and safety profile. This could have taken for example to one spending some quality time with a clinical pharmacology textbook to refresh one’s memory. Even better, one could adopt the dosing/monitoring protocol used by the randomized trial trying to reproduce the results in his or her practice. That physician can probably be much less conservative in his or her assessment of effectiveness. Rather than saying that the effectiveness (proportion of responders) can be any number between 0 and trial efficacy (a heuristic with which to understand the mathematics behind the rule) this physician even expect that the trial data will reflect the real world experience.

This is a crucial point and one that is rarely made: one of the causes of the apparent failure of increased efficacy to translate to increased effectiveness has to do with contextual elements that are not adopted in the real world. For spironolactone this requires pre-treatment screening and frequent monitoring of renal function and potassium levels, as the post-RALES publication record reveals.

In summary I feel that the mathematics of skepticism were in fact validated by RALES. However one would like to do better and answer the objection of David in the general case: at what point and how do we dispense with the skepticism? When is initial skepticism justified? How do we combine real word experience with expectations based on trial data and a priori beliefs about a given therapy’s benefits to inform our knowledge and advice our patients?

These will be covered in follow up posts.