Archive for the ‘Bayesian’ Category

Survival Analysis With Generalized Additive Models : Part I (background and rationale)

May 1, 2015

After a really long break, I’d will resume my blogging activity. It is actually a full circle for me, since one of the first posts that kick started this blog, matured enough to be published in a peer-reviewed journal last week. In the next few posts I will use the R code included to demonstrate the survival fitting capabilities of Generalized Additive Models (GAMs) in real world datasets. The first post in this series will summarize the background, rationale and expected benefits to be realized by adopting GAMs from survival analysis.

In a nutshell, the basic ideas of the GAM approach to survival analysis are the following:

  1. One approximates the integral defining the survival function as a discrete sum using a quadrature rule
  2. One evaluates the likelihood at the nodes of the aforementioned quadrature rule
  3. A regression model is postulated for the log-hazard rate function
  4. As a result of 1-3 the survival regression problem is transformed into a Poisson regression one
  5. If penalized regression is used to fit the regression model, then GAM fitting software may be used for survival analysis

Ideas along the lines 1-4 have been re-surfacing in the literature ever since the Proportional Hazards Model was described. The mathematical derivations justifying Steps 1-4 are straightforward to follow and are detailed in the PLoS paper. The corresponding derivations for the Cox model are also described in a previous post.

Developments such as 1-4 were important in the first 10 years of the Cox model, since there were no off-the-shelf implementations of the partial (profile) likelihood approach. This limited the practical scope of proportional hazards modeling and set off a parallel computational line of research in how one could use other statistical software libraries to fit the Cox model.  In fact, the first known to the author application of a proportional model for the analysis of a National Institute of Health (NIH) randomized controlled trial used a software implementing a Poisson regression to calculate the hazard ratio. The trial was the NCDS trial that examined adequacy indices for the hemodialysis prescription (the description of the software was published 6 months prior to the clinical paper).  Many of these efforts were computationally demanding and died off as the Cox model was implemented in the various statistical packages after the late 80s and semi-parametric theory took off and provide a post-hoc justification for many of the nuances implicit in the Cox model.  Nevertheless, one can argue that in the era of the modern computer, no one really needs the Cox model. This technical report and the author’s work on a real world, complex dataset provides the personal background for my research on GAM approaches for survival data.

The GAM (or Poisson GAM, PGAM as called in the paper) is an extension of these old ideas (see the literature survey here and here). In particular, PGAM models the quantities that are modeled semi-parametrically (e.g. the baseline hazard) in the Cox model with parametric, flexible functions that are estimated by penalized regressio. One of the first applications of penalized regression for survival analysis is the Fine and Gray spline model, which is however not a PGAM model.  There are specific benefits to be realized from penalizing the Poisson regression and adopting GAMs  in the context of survival analysis:

  • Parsimony: degrees of freedom are not wasted as penalization will seek the most parsimonious representation (fewer degrees of freedom) among the many possible functions that may fit the data
  • Reduction of the analyst-degrees-of freedom: the shape of the functional relationships between survival and outcome are learned from the data. This limits the potential of someone putting a specific form for this relationship (e.g. linear v.s. quadratic) and running away with the most convenient p-value
  • Multiple time scale modelling: one can model more than one time scales in a dataset (i.e. not just the study time). This is useful when adjusting for secular trends in an observational dataset or even in a randomized trial. In particular cluster randomized trials at the community level may be immune to secular trends
  • Non-proportional hazards modeling: when the hazards are not proportional, the Cox model is not applicable. Many oncology datasets will demonstrate a deviation from proportionality (in fact we re-analyzed such a trial in the PLoS paper) . For a particular dataset, one would like to know whether the proportionality assumption is violated, and if so one would like to “adjust” for it. Such an adjustment will take the form of a time-varying hazard ratio function and these may be estimated with the PGAMs. In such a case, one can even extract an “average” hazard ratio while still estimating a time-varying deviation around it using the PGAM. However non-proportionality should shift the analyst to:
  • Alternative measures of treatment effect: These may include relative risks, absolute risk differences of even differences in the (Restricted) Mean Survival Time. Such measures are calculated from the time varying hazard ratios using statistical simulation techniques
  • Handling of correlated outcomes: Correlated outcomes may arise from center effects, multiple events in the same individual or even cluster randomization designs. The analysis of such outcomes is facilitated by the interpretation of the PGAM as a generalized linear mixed model and the introduction of the corresponding random effects and their correlation structure into the regression
  • Center effects: A variety of modeling options are available including stratified hazards, fixed or random effects
  • Subgroup analyses
  • Time varying external covariates
  • Unconditional/Conditional/Population averaged effects: The unconditional estimate is obtained by indexing the individuals with the group they are assigned to (e.g. placebo or drug in an interventional trial). The conditional estimate is obtained by introducing covariates (e.g. gender, age) into the regression to calculate effects for individuals that share these characteristics. The population effect averages over the conditional effects over all the individuals in the trial. In epidemiology it is known as the corrected group prognosis method. This was introduced in a JAMA paper almost 15 years ago as a way to generate adjusted survival probability curves
  • Handling of right censored/left truncated/uncensored data

These benefits follow directly from the  mixed model equivalence between semi-parametric, penalized regression and Generalized Mixed Linear Models. An excellent, survey may be found here, while Simon Wood’s book in the GAM implementation of the mgcv package in R contains a concise presentation of these ideas.

As it stands the method presented has no software  implementation similar to the survival package in R. Even though we provide R code to run the examples in the paper, the need for the various functions may not be entirely clear. Hence the next series of posts will go over the code and the steps required to fit the PGAM using the R programming language.

Where has evidence based medicine taken us in 20 years?

January 10, 2014

This one of the best appraisals of evidence based medicine(EBM):

It highlights one important pitfall of EBM, ie its failure to integrate scientific(biological) plausibility in the pyramid of evidence.

I think the best way to effect a synthesis between evidence from clinical data and biological plausibility is through a Bayesian argument:

– Decide on a standard EBM approach to synthesize the clinical data regarding a hypothesis from multiple sources eg a meta-analysis
– Encode the non-clinical, background knowledge into a prior. This prior encodes the scientific plausibility of the hypothesis under question

A Bayesian analysis of the clinical data under a scientific plausibility prior provides a transparent way to leverage the pyramid evidence of EBM while providing a basic science/disease understanding context for the interpretation of clinical data.

Even with non-informative priors, the Bayesian approach is preferable to frequentist linear regression

December 2, 2013

A unique aspect of the Bayesian approach is that it allows one to integrate previous knowledge (“prior beliefs“) with current data (“likelihood“). Yet even in those cases in which non-informative priors are used (as in here and here) , the Bayesian approach is preferable due to its conservative nature. (more…)

The little non-informative prior that could (be informative)

November 26, 2013

Christian Robert reviewed on line a paper that was critical of non-informative priors. Among the points that were discussed by him and other contributors (e.g. Keith O’Rourke), was the issue of induced priors, i.e. priors which arise from a transformation of original parameters, or of observables. I found this exchange interesting because I did something similar when revisiting an old research project that had been collecting digital dust in my hard disk. The specific problem had to do with analysis of a biomarker that was measured with a qualitative technique yielding a binary classification of measurements as present or absent, in two experimental conditions (call them A and B). Ignoring some technical aspects of the study design, the goal was to calculate the odds ratio of the biomarker being expressed in condition B v.s A (the reference state signifying absence of disease).

When writing the programs for the analysis, I defaulted to the N(0.0,1.0E-6) prior that epitomizes non-informativeness in BUGS. However, one of my co-authors asked the “What the @#$%& does this prior mean?” question. And then we stopped … and reflected on what we were about to do. You see, before the experiment started we had absolutely no prior information about the behaviour of the biomarker in either experimental state so that we did not want to commit one way or another. In other words, Laplace’s original uniform (or Beta(1,1)) prior would have been reasonable if the expression data for  A and B were to be analyzed separately. However, we wanted to analyze the data with a logistic regression model, so was the ubiquitous N(0.0,1.0E-6) the prior we were after?

The answer is a loud NO! According to Wikipedia, the mother of all knowledge, the logistic transformation of a uniform variate is the logistic distribution with location of zero and scale of 1. Hence, the prior on the intercept of the logistic regression (interpretable as the odds of the biomarker being expressed in state A) had to be a Logistic(0,1).


Surprisingly the Odds Ratio of B v.s. A was found (after trial and error and method of moments considerations) to be very well approximated by a 1:1 mixture of a logistic and a Gaussian which clearly departs from the N(0.0,1.0-6) prior we (almost) used:


Bottom line: Even informative (in the BUGS sense!) priors can be pretty non-informative in some intuitively appropriate parameterization. Conversely, one could start with a non-informative prior in a parameterization that is easier to reason about and look for an induced prior (using analytic considerations or even simulations) to convert it to a parameterization that is more appropriate to the analytic plan at hand.

(R code for the plots and simulations is given below)

## approximating uniforms
logit<-function(x) log(x/(1-x))
## logistic is logit of a uniform
      "Logistic(0,1)"),lty=1,col=c("blue","red") )

## approximating the difference of two uniforms
     xlim=c(-10,10),main="OR between two U(0,1)",
## logistic approximation
## normal
## mixture of a logistic and a normal approximation
## legends
ML<-expression(paste("0.5 Normal(0,",pi*sqrt(2/3),")+0.5 Logistic(0,",sqrt(2),")"))
       lty=1,col=c("blue","red","green") )

## does it extend to more general cases?

Edit (29/11/2013):
Updated the first image due to an accidental reversal of the distribution labels

Bayesian Linear Regression Analysis (with non-informative priors but without Monte Carlo) In R

November 24, 2013

Continuing the previous post concerning linear regression analysis with non-informative priors in R, I will show how to derive numerical summaries for the regression parameters without Monte Carlo integration. The theoretical background for this post is contained in Chapter 14 of Bayesian Data Analysis which should be consulted for more information.

The Residual Standard Deviation

The residual standard deviation \sigma is just the square root of the residual variance \sigma^2 which has a scaled inverse chi-square distribution given the data and the covariates: \sigma^2 \sim Scale-inv-\chi^2(\nu,s^2) with \nu=N-p and s^2 are the residual degrees of freedom and residual variance reported by the (frequentist) least square fit (R function lm). Possible point estimates for the residual standard deviation are the posterior mean, the mode and the median which can be obtained through (numerical) integration of the probability density function (PDF), maximization of the PDF and the quantile function respectively. The standard deviation may also be obtained via univariate numerical integration of the PDF. To obtain the latter, we apply a standard change of variables transformation to the scaled inverse chi-square PDF to obtain:

p(\sigma|y) = \frac{s^2 \nu/2}{\Gamma(\nu/2)}\times exp(-\frac{\nu s^2}{2 \sigma^2})\times \sigma^{-(n+2)}

where \Gamma(x) is the Gamma function. Comparing expressions, the Cumulative Density Function CDF of \sigma|y can be seen to be equal to the survival function of the gamma distributed random variable : 1-F_{\Gamma}(\sigma^{-2};\frac{\nu}{2},\frac{2}{\nu s^2}), where F_{\Gamma}(x;k,\theta) is the value, at x, of the CDF of the gamma variable with degrees of freedom and scale k, \theta respectively. The median (or any other quantile i.e. latex q$) can be obtained by solving the equation:

1-F_{\Gamma}(\sigma^{-2};\frac{\nu}{2},\frac{2}{\nu s^2}) = q

No closed form appears to exist for the quantiles (median, lower and upper limits of a symmetric credible interval, CrI) so that the equation above needs to be solved numerically. Since the marginal \sigma |y is a skewed distribution, the mean bounds (from above) both the median and the lower limit of thecredible (not to be confused with a confidence!) interval. Similarly, the upper limit of the CrI is bound by the corresponding limit of a Gaussian with mean and standard deviation equal to the values we previously obtained through numerical integration. These observations are used to restrict the range over which numerical optimization will be carried out, to avoid false convergence or outright failures.

The mode can be obtained by direct differentiation of the PDF and is given by the closed form expression: s\sqrt{\frac{n}{n+1}}

The Regression Coefficients

The marginal distribution of the entire coefficient vector \beta | y is a multivariate T distribution with location vector \hat{\beta} (obtained with lm) and scale matrix V_ {\beta}\sigma^2. Both these quantities may be obtained from the values returned by lm. To calculate numerical summaries for each of the components of this vector the other components need to be integrated out (marginalization). It is a basic property of the multivariate T distribution that its marginal  distributions are univariate t so that the following relation holds for each regression parameter:

\frac{\beta_i - \hat{\beta_i}}{s \sqrt{V_{\beta}[i,i]}} \sim t_{\nu}

Since this is a symmetric unimodal distribution, the mean, mode and median coincide these are all equal to the maximum likelihood estimates, while its standard deviation is available in closed form: \sqrt{\frac{n}{n-2}}. Since the quantiles univariate t are already available in R there is no need to write special code to obtain any of numerical summaries of the regression coefficients: one simply calculates the relevant quantities from the univariate t with degrees of freedom equal to \nu =n-p once, and translates/scales using the appropriate elements of the location vector and scale matrix.

The Gory Details

Two R functions are used to carry out the aforementioned tasks, i.e. one function that provides the integrand for the mean of the scaled inverse chi square distribution:

## integrand for the mean of the scaled inverse
##chisq2 with df=n, scale=t2

and a second (much longer!) function that receives the fitted lm object and calculates the numerical summaries for the regression coefficients and the residual standard deviation:

	## extract coefficients, dfs and variance
	## matrix from lm
    	R<-qr.R(QR) ## R component
    	Vb<-chol2inv(R) ## variance(unscaled)

	## standard errors of the univariate t
	## dataframe for returned values
	## CrI limits for t-distributed quantities

	## make extra space for sigma
	## mean of scale
	## expectation of the square of the scale
	## (which is equal to the expectation of the
	## initial inverse chi square distribution
	ret[3,1]<-M1	## mean
	ret[3,2]<-sqrt(S1-M1^2)	## sd
	ret[3,3]<-ret[3,1]/ret[3,2]	## t
	ret[3,4]<-sqrt(s2*df/(df+1)) ## mode
	## calculate quantiles - these take
	## place in the scale of the precision
	## median
	ret[3,5]<-uniroot(function(x) pgamma(x,
	## lower limit of 95% CrI
	ret[3,6]<-uniroot(function(x) pgamma(x,
	## upper limit of 95% CrI
	ret[3,7]<-uniroot(function(x) pgamma(x,
	## raise to -1/2 to change from
	## precision to standard deviations


To see some action I will revisit the linear regression example introduced in the first post and compare frequentist (lm) and Bayesian estimates obtained with Monte Carlo and non-Monte Carlo approaches:

 summary(lmfit) ## Frequentist

lm(formula = weight ~ group)

    Min      1Q  Median      3Q     Max
-1.0710 -0.4938  0.0685  0.2462  1.3690

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.07308,   Adjusted R-squared:  0.02158
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

> t(apply(bf,2,Bayes.sum)) ## Bayesian (MC)
                  mean        se         t     median   CrI.2.5% CrI.97.5%
(Intercept)  5.0321144 0.2338693 21.516785  5.0326123  4.5709622 5.4947022
groupTrt    -0.3687563 0.3305785 -1.115488 -0.3706862 -1.0210569 0.2862806
sigma        0.7270983 0.1290384  5.634745  0.7095935  0.5270569 1.0299920
> bayesfitAnal(lmfit) ## Bayesian (non-MC)
                  coef        se         t       mode     median   CrI.2.5. CrI.97.5.
(Intercept)  5.0320000 0.2335761 21.543296  5.0320000  5.0320000  4.5412747 5.5227253
groupTrt    -0.3710000 0.3303265 -1.123131 -0.3710000 -0.3710000 -1.0649903 0.3229903
sigma        0.7271885 0.1295182  5.614566  0.6778158  0.7095623  0.5262021 1.0298379

The conservative, skeptical nature of the Bayesian inference (wider confidence intervals, due to proper acknowledgement of uncertainty) is evident no matter the numerical approach (Monte Carlo v.s. numerical integration) one uses for the inference. Although the numerical estimates of the regression coefficients agree (up to three decimal points) among the three numerical approaches, residual variance estimates don’t for this relatively small data set.

Monte Carlo estimates are almost as precise as the numerical integration for things like the mean and standard error of the estimated parameters, yet they lose precision for extreme quantilies. Monte Carlo is also slower, which may become an issue when fitting larger models.

That was all folks! In subsequent posts Iwill explore features of the Bayesian solution  that will (hopefully) convince you that it is a much better alternative (especially when it doesn’t bring tears to your eyes!) relative to the frequentist one. If you use any of the code in these posts, I’d appreciate if you can drop a line 🙂

Bayesian linear regression analysis without tears (R)

November 17, 2013

Bayesian methods are sure to get some publicity after Vale Johnson’s PNAS paper regarding the use of Bayesian approaches to recalibrate p-value cutoffs from 0.05 to 0.005. Though the paper itself is bound to get some heat (see the discussion in Andrew Gelman’s blog and Matt Briggs’s fun-to-read deconstruction), the controversy might stimulate people to explore Bayesianism and (hopefully!) to move away from frequentist analyses. (more…)

Page Rev Bayes – we found statistical irregularities in a randomized controlled trial

November 9, 2013

The Bayesian counterpart to the frequentist analysis of the Randomized Controlled Trial is in many aspects more straightforward than the Bayesian analysis. One starts with a prior probability about the probability of a patient being assigned to each of the three arms and combines it with the (multinomial) likelihood of observing a given assignment pattern in the 240 patients enrolled in the study. Bayes theorem gives the posterior probability quantifying our belief about the magnitudes of the unknown assignment probabilities. Note that testing the strict equality is bound to lead us straight to the arms of the Lindley paradox so that a different approach is likely to be more fruitful. Specifically, we specify a maximum tolerable threshold for the difference between the maximum and the minimum probability of being assigned the trial arms (let’s say 1-5%) and we directly calculate the probability for this difference (“probability of foul play”).

In the absence of prior evidence for (or against) foul play we use a non-informative prior in which all possible values of assignment probabilities are equally plausible. This (Dirichlet) prior corresponds to a prior state of knowledge in which three individuals were randomized and all three ended up in different treatment arms. Under this prior, the posterior distribution is itself a Dirichlet distribution with parameters equal to the number of individuals actually assigned to each arm+1. The following R code may then be used to calculate the probability of foul play, as previously defined i.e.


This probability comes down to 0.4% which is numerically close to the frequentist answer, yet with a more intuitive interpretation: based on the observed trial sizes and a numerical tolerance for the maximum tolerable difference in assignment probability the odds for “foul play” are 249:1.
Increasing the tolerance will obviously decrease these odds, but in such a case we would be willing to tolerate larger differences in assignment probabilities. Although these results are mathematically trivial (and non-controversial), the plot will become more convoluted when one proceeds to use them to make a declaration of “foul play”. For in that case, a decision needs to be made which has to consider not only the probability of the uncertain events: “foul play” v.s. “not foul play” but also the consequences for the journal, the study investigators and the scientific community at large. At this level one would need to decide whether the odds of 249:1 are high enough or not for subsequent action to be taken. But this consideration will take us to the realm of decision theory (and it already 11pms).

Is this evidence of scientific fraud?

October 30, 2013

(the names of countries, journals, interventions and sample sizes have been changed to protect the potentially innocent and to avoid cognitive biases, invocations of  stereotypes and accusations that could lead to World War Z)

Two years ago the prestigious medical journal The Spleen published a randomized controlled trial that evaluated three therapies: genetically modified leeches (GML), standard leeches (SL) and mechanical blood-letting (MBL, standard of care) for acute complicated absentmindedness (ACA). This single center study randomized, in 1:1:1 ratio , 240 patients who presented at the Emergency Department of the Grand Fenwick Memorial Hospital and concluded that GML was associated with a 90% improvement in outcomes relative to SL and MBL. The lead author, a prominent absentmindedneologist and President of the corresponding Society of the Duchy of Grand Fenwick, concluded that GML should become the new standard of care and that SL and MCL should not be offered except as second line treatment for patients failing GML.


Robert Heinlein and the distinction between a scientist and an academician

October 27, 2013

Robert Heinlein, the author of Starship Troopers(no relationship to the movie by the way) wrote an interesting paragraph in his 1939 short story Life-Line:

One can judge from experiment, or one can blindly accept authority. To the scientific mind, experimental proof is all important and theory is merely a convenience in description,
to be junked when it no longer fits. To the academic mind, authority is everything and facts are junked when they do not fit theory laid down by authority.

This short paragraph summarises the essence of the differences between Bayesian (scientific mind) and frequentist (academic mind) inference or at least their application in scientific discourse.

For objective Bayesians, models are only convenience instruments to summarise and describe possibly multi-dimensional data without having to carry the weight of paper, disks, USB sticks etc containing the raw points. Parameters do the heavy lifting of models and the parametric form of a given model may be derived in a rigorous manner using a number of mathematical procedures (e.g. maximum entropy). Given such a specification, one can use an empiric body of data D to calculate P(M|D) sequentially rejecting models that do not fit (a nice example is given in the second section of Jayne’s entropy concentration paper).

Now consider the situation of the frequentist mind: even though one can (and most certainly will!) use a hypothesis test (aka reject the null) to falsify the model, the authoritarian can (and most certainly will!) hide behind the frequentist modus operandi and claim that only an unlikely body of data was obtained, not that an inadequate model was utilized.

This is seen (and in fact enforced) in the discussion of every single scientific paper under the standardized second-to-last paragraph about ‘alternative explanations’. This section is all about bowing down to authority, offering (often convoluted and mystical) explanations that the data obtained are at fault and that the model falsified in the results section is in fact true. Depending on the weight of the authoritative figures who will write the commentary about the aforementioned paper, we can (and most certainly will!) end up in the highly undesirable situation of falsifying data rather than models.

Compare this with the hypothetical structure of a Bayesian paper in which the alternative hypotheses would be built as alternative models (or values of parameters) only to be systematically compared and the ill-fitting models (even those held to be true by academic figures of the highest authority) are triaged to oblivion.

As a concluding statement, note that our systematic failure to respond to the financial crisis or even to advance science in the last 3-4 decades can be traced to the dominating influence of academicians over scientists. Rather than systematically evaluating evidence for or against particular models in specific domains, we seem to only judge models/explanations by the authority/status of their proponents, a situation not unlike the one in the 30s when Heinlein wrote the aforementioned piece.

The agnostic approach to effectiveness

September 3, 2013

The second attitude towards effectiveness is an agnostic one; such a clinician  considers previous real world experience more relevant than efficacy figures when it comes to assessing effectiveness. Randomized Controlled Trial (RCT) results are not dismissed, but are discounted nonetheless at a very high rate. To the agnostic reading about the efficacy of an intervention in an RCT only implies that both success and failure are possible outcomes but the reported figures are not relevant.  If the outcomes in real world patients rein supreme, then the trial results should be quantified in such a way as to correspond to such an experience. An extreme version of agnosticism will mathematically translate such an assessment to have the minimum possible influence, or equivalently the least number of “additional” patients that should be added to the real world record: one success and one failure “pseudocase”.