Archive for the ‘mathematical statistics’ Category

Sequential Fitting Strategies For Models of short RNA Sequencing Data

June 18, 2017

After a (really long!) hiatus I am reactivating my statistical blog. The first article  concerns the clarification of a point made in the manual of our recently published statistical model for short RNA sequencing data.
The background for this post, in case one wants to skip reading the manuscript (please do read it !), centers around the limitations of existing methods for the analysis of data for this very promising class of biomarkers. To overcome these limitations our group comprised from investigators from Division of Nephrology, University of New Mexico and the Galas Lab at Pacific Northwest Research Institute introduced a novel method for the analysis of short RNA sequencing (sRNAseq) data. This method (RNAseqGAMLSS), which was derived from first principles modeling of the short RNAseq process, was shown to have a number of desirable properties in an analysis of nearly 200 public and internal datasets:

  1. It can quantify the intrinsic, sequencing specific bias of sRNAseq from calibration, synthetic equimolar mixes of the target short RNAs (e.g. microRNAs)
  2.  It can use such estimates to correct for the bias present in experimental samples of different composition and input than the calibration runs. This in turns opens the way for the use of short RNAseq measurements in personalized medicine applications (as explained here)
  3. Adapted to the problem of differential expression analysis, our method exhibited  greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of short RNA-seq data.
  4. In contrast to these popular methods which force the expression profiles to have a certain form of symmetry (equal number and magnitude of over-expressed and under-expressed sequences), our method can be used to discover global, directional changes in expression profiles which are missed by the aforementioned methods. Accounting for such a possibility may be appropriate in certain instances, in which the disease process leads to loss or gain in the number of cells of origin of the affected organ/tissue.

The proposed methodology which is based on Generalized Additive Models for Location, Scale and Shape (GAMLSS) involves the fitting of simultaneous regression models for the location (mean) and the scale (dispersion) of sequence counts using either the Negative Binomial or a particular parameterization of the Normal distribution. However there is price to pay for the advantages of RNASeqGAMLSS over alternatives: this comes in the form of a small (but not infinitesimal) probability[1] that the fitting algorithm will execute successfully. In the manual of our method (Section 6.4) we explain that a numerically more stable way of fitting these complex models exists and should be adapted if one encounters numerical optimization errors with the default approach used in the Nucleic Acids Research (NAR) manuscript. The three steps of this sequential fitting strategy are as follows:

  1. One fits a Poisson count mixed model to the RNAseq data, to obtain estimates of the relative mean expression for the different short RNA species in the expression profile
  2. These estimates are used to fix the values of the mean parameter model of the RNASeqGAMLSS model while estimating the values of the dispersion parameters.
  3. Finally, one uses the values of the mean (Step 1) and dispersion (Step 2) parameters to fit the general RNASeqGAMLSS model

In essence one ignores the overdispersion (additional variability) of the short RNAseq data (Step 1) to guide the algorithm into estimates of the dispersion parameters (Step 2). Finally one uses the separate estimates of the mean (Step 1) and dispersion (Step 2) parameters as an initial point for the simultaneous estimation of both (Step 3). The reason that this approach works is because the dispersion parameters do not affect the mean parameters, so that the Poisson distribution of Step 1 has the same mean as the RNASeqGAMLSS model. Hence the estimates produced by this Step are identical (to the limit of numerical precision) to those that would have been produced by a successful execution of the RNASeqGAMLSS optimization algorithms. Fixing these values when fitting the RNASeqGAMLSS model in Step 2 facilitates estimation of the dispersion parameters. Having very good initial guesses for these parameters virtually guarantees convergence of the 3rd Step (which is the only step in the NAR paper).

A fully worked out example is shown below (note that the data used in the NAR paper, source code, manual that includes instructions to compile the source code of the RNASeqGAMLSS and Win64 DLL libraries are all available in the BitBucket repository for this project)

First, we load the data, the C++ libraries and extract the data to the two groups we would like to compare :

library(TMB) ## the TMB framework for optimizati
## load the DE C++ libraries
## Note about the form of data storage for use with this software
## the long format should be employed when storing microRNA data for
## GAMLSS type of analyses: in this format, the data are arranged in columns:
## - column miRNA : yields the name of the miRNA
## - column reads : reads of the miRNA from a *single sample*
## - column SampleID: the sample the reads were recorded at
## this is called long format, because data from experiments are stored in
## different rows (rather than appearing as consecutive columns)
## lads the data from the 286 series
## Obtain data for GAMLSS - we will compare the two ratiometric series
datRat<-subset(dat286.long,(Series=="RatioB" | Series =="RatioA") & Amount=="100 fmoles")

u_X<-as.numeric(factor(datRat$miRNA)) ## maps readings to the identity of the miRNA
u_G<-as.numeric(factor(datRat$Series)) ## maps counts to group
y=datRat$reads ## extract the actual counts
X<-model.matrix(~Series,data=datRat) ## design matrix (ANOVA) for group comparisons

Secondly, we fit the Poisson model (Step 1), using the facilities of the lme4 R package:

## fit the parameters for the mu submodel using the poisson GLM

Then we extract the values of these parameters and used them to fix the values of the mean submodel (Step 2):

## initializes standard deviation of RE for the mu submodel
sigsig=rep(1,max(u_G)) ## initializes standard deviation of RE for the phi submodel
b=fixef(gl) ## initial values for the overall group means (mean submodel)
## initial values for the variation of miRNAs around their group mean (mean submodel)
## Very rough initial values for the phi submodel parameters
s_b=rep(0,ncol(X)) ## initial values for the overall group means (phi submodel)
## initial values for the variation of miRNAs around their group mean (phi submodel)
u_s= matrix(0,max(u_X),max(u_G))
## MAP list that allow us to fix some parameters to their values
## construct the AD object - note that we fix the mu at their values while estimating the
## phi submodel
## parameter estimation - note errors may be generated during some iterations
## obtain the report on the parameters to extract the fitted values of the gamlss model
u_s = matrix(summary(rep,"random",p.value=FALSE)[,1],ncol=max(u_G))
dummy<-summary(rep,"fixed",p.value=FALSE)[,1] ## parameter estimates

Finally, we refit the model letting all parameters vary:

## scale objective by the magnitude of the deviance of the fitted Poisson model
## obtain the report on the parameters
## differential expression ratios, standard errors z and p-values
## rownames are the miRNAs; columns are estimates, standard error, z value and pvalue

## the final estimates with their standard errors

These steps (and the RNASeqGAMLSS code) is going to be incorporated in an upcoming Bioconductor package for the analysis of short RNA sequencing data by Dr Lorena Pantano PhD. Until this package becomes available, the aforementioned code snippets may adapted very easily to one’s application by suitable adaptations of the code (i.e. the names of the columns corresponding to the RNA species, sample identifiers and experimental groups).

1. This is particularly likely when the underlying software implementations are not compiled against the Intel®Math Kernel Libraries.

Survival Analysis With Generalized Additive Models: Part V (stratified baseline hazards)

May 9, 2015

In the fifth part of this series we will examine the capabilities of Poisson GAMs to stratify the baseline hazard for survival analysis. In a stratified Cox model, the baseline hazard is not the same for all individuals in the study. Rather, it is assumed that the baseline hazard may differ between members of groups, even though it will be the same for members of the same group.

Stratification is one of the ways that one may address the violation of the proportionality assumption for a categorical covariate in the Cox model. The stratified Cox model resolves the overall hazard in the study as:

h_{g}(t,X) = h_{0_{g}}(t)exp(\boldsymbol{x\beta}) ,\quad g=1,2,\dotsc ,g_{K}

In the logarithmic scale, the multiplicative model for the stratified baseline hazard becomes an additive one. In particular, the specification of a different baseline hazard for the different levels of a factor amounts to specifying an interaction between the factor and the smooth baseline hazard in the PGAM.

We turn to the PBC dataset to provide an example of a stratified analysis with either the Cox model or the PGAM. In that dataset the covariate edema is a categorical variable assuming the values of 0 (no edema), 0.5 (untreated or successfully treated) and 1(edema despite treatment). An analysis of the Schoenfeld residual test shows that this covariate violates the proportionality assumption

> f<-coxph(Surv(time,status)~trt+age+sex+factor(edema),data=pbc)
> Schoen<-cox.zph(f)
> Schoen
rho chisq p
trt -0.089207 1.12e+00 0.2892
age -0.000198 4.72e-06 0.9983
sexf -0.075377 7.24e-01 0.3950
factor(edema)0.5 -0.202522 5.39e+00 0.0203
factor(edema)1 -0.132244 1.93e+00 0.1651
GLOBAL NA 8.31e+00 0.1400

To fit a stratified GAM model, we should transform the dataset to include additional variables, one for each level of the edema covariate. To make the PGAM directly comparable to the stratified Cox model, we have to fit the former without an intercept term. This requires that we include additional dummy variables for any categorical covariates that we would to adjust our model for. In this particular case, the only other additional covariate is the female gender:



Then the stratifed Cox and PGAM models are fit as:



In general the values of covariates of the stratified Cox and the PGAM models are similar with the exception of the trt variable. However the standard error of this variable estimated by either model is so large, that the estimates are statistically no different from zero, despite their difference in magnitude

> fs
coxph(formula = Surv(time, status) ~ trt + age + sex + strata(edema), 
 data = pbc)

 coef exp(coef) se(coef) z p
trt 0.0336 1.034 0.18724 0.18 0.86000
age 0.0322 1.033 0.00923 3.49 0.00048
sexf -0.3067 0.736 0.24314 -1.26 0.21000

Likelihood ratio test=15.8 on 3 df, p=0.00126 n= 312, number of events= 125 
 (106 observations deleted due to missingness)
> summary(fGAM)

Family: poisson 
Link function: log 

gam.ev ~ s(stop, bs = "cr", by = edema0) + s(stop, bs = "cr", 
 by = edema05) + s(stop, bs = "cr", by = edema1) + trt + age + 
 sexf + offset(log(gam.dur)) - 1

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
trt 0.002396 0.187104 0.013 0.989782 
age 0.033280 0.009170 3.629 0.000284 ***
sexf -0.297481 0.240578 -1.237 0.216262 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop):edema0 2.001 2.003 242.0 <2e-16 ***
s(stop):edema05 2.001 2.001 166.3 <2e-16 ***
s(stop):edema1 2.000 2.001 124.4 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

R-sq.(adj) = -0.146 Deviance explained = -78.4%
REML score = 843.96 Scale est. = 1 n = 3120</pre>

Survival Analysis With Generalized Additive Models : Part IV (the survival function)

May 3, 2015

The ability of PGAMs to estimate the log-baseline hazard rate, endows them with the capability to be used as smooth alternatives to the Kaplan Meier curve. If we assume for the shake of simplicity that there are no proportional co-variates in the PGAM regression, then the quantity modeled  corresponds to the log-hazard of the  survival function. Note that the only assumptions made by the PGAM is that the a) log-hazard is a smooth function, with b) a given maximum complexity (number of degrees of freedom) and c) continuous second derivatives. A PGAM provides estimates of the log-hazard constant, \beta_{0}, and the time-varying deviation, \lambda(t_{i,j}). These can be used to predict the value of the survival function, S(t), by approximating the integral appearing in the definition of S(t) by numerical quadrature.

S(t_{i})=\exp\left(-\int_{0}^{t_{i}}h(t)\mathrm{\, d}t\right)\approx\exp\left(-\sum_{j=1}^{N_{i}}w_{i,j}\exp(\beta_{0}+\lambda(t_{i,j}))\right)

From the above definition it is obvious that the value of the survival distribution at any given time point is a non-linear function of the PGAM estimate. Consequently, the predicted survival value, S_{pred}(t), cannot be derived in closed form; as with all non-linear PGAM estimates, a simple Monte Carlo simulation algorithm may be used to derive both the expected value of \hat{S}_{pred}(t) and its uncertainty. For the case of the survival function, the simulation steps are provided in Appendix (Section A3) of our paper. The following R function can be used to predict the survival function and an associated confidence interval at a grid of points. It accepts as arguments a) the vector of time points, b) a PGAM object for the fitted log-hazard function, c) a list with the nodes and weights of a Gauss-Lobatto rule for the integration of the predicted survival, d) the number of Monte Carlo samples to obtain and optionally e) the seed of the random number generation. Of note, the order of the quadrature used to predict the survival function is not necessarily the same as the order used to fit the log-hazard function.

## Calculate survival and confidence interval over a grid of points
## using a GAM
##         t : time at which to calculate relative risk
##        gm : gam model for the fit
##   gl.rule : GL rule (list of weights and nodes)
##        CI : CI to apply
##      Nsim : Number of replicates to draw
##      seed : RNG seed
## create the nonlinear contrast
start<-0; ## only for right cens data
## map the weights from [-1,1] to [start,t]
## expand the dataset
## linear predictor at each node
Xp <- predict(gm,newdata=df,type="lpmatrix")
## draw samples
br <- rmvn(Nsim,coef(gm),gm$Vp)
for(i in 1:Nsim){
## hazard function at the nodes
## cumumative hazard
chz1<-gl.rule$w %*% hz[1:L,]


The function makes use of another function, Survdataset, that expands internally the vector of time points into a survival dataset. This dataset is used to obtain predictions of the log-hazard function by calling the predict function from the mgcv package.

## Function that expands a prediction dataset
## so that a GL rule may be applied
## Used in num integration when generating measures of effect
## GL  : Gauss Lobatto rule
## data: survival data
##   fu: column number containing fu info

## append artificial ID in the set
## Change the final indicator to what
## was observed, map node positions,
## weights from [-1,1] back to the
## study time
## now merge the remaining covariate info


The ability to draw samples from the multivariate normal distribution corresponding to the model estimates and its covariance matrix is provided by another function, rmvn:

## function that draws multivariate normal random variates with
## a given mean vector and covariance matrix
##    n : number of samples to draw
##   mu : mean vector
##  sig : covariance matrix
rmvn <- function(n,mu,sig) { ## MVN random deviates
L <- mroot(sig);m <- ncol(L);
t(mu + L%*%matrix(rnorm(m*n),m,n))

To illustrate the use of these functions we revisit the PBC example from the 2nd part of this blog series. Firstly, let’s obtain a few Gauss-Lobatto lists of weights/nodes for the integration of the survival function:

## Obtain a few Gauss Lobatto rules to integrate the survival
## distribution

Subsequently, we fit the log-hazard rate to the coarsely (5 nodes) and more finely discretized (using a 10 point Gauss Lobatto rule) versions of the PBC dataset, created in Part 2. The third command obtains the Kaplan Meier estimate in the PBC dataset.



We obtained survival probability estimates for the 6 combinations of time discretization for fitting (either a 5 or 10th order Lobatto rule) and prediction (a 5th, 10th or 20th order rule):


In all cases 1000 Monte Carlo samples were obtained for the calculation of survival probability estimates and their pointwise 95% confidence intervals. We can plot these against the Kaplan Meier curve estimates:

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL5)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL10)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL20)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL5)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL10)")

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL20)")

Survival probability estimates: Kaplan Meier curve (black) v.s. the PGAM estimates for different orders of Gauss Lobatto (GL) quadrature

Survival probability estimates: Kaplan Meier curve (black) v.s. the PGAM estimates for different orders of Gauss Lobatto (GL) quadrature

Overall, there is a close agreement between the Kaplan Meier estimate and the PGAM estimates despite the different function spaces that the corresponding estimators “live”: the space of all piecewise constant functions (KM) v.s. that of the smooth functions with bounded, continuous second derivatives (PGAM). Furthermore, the 95% confidence interval of each estimator (dashed lines) contain the expected value of the other estimator. This suggests that there is no systematic difference between the KM and the PGAM survival estimators. This was confirmed in simulated datasets (see Fig 2 in our PLoS One paper).

Survival Analysis With Generalized Additive Models : Part III (the baseline hazard)

May 2, 2015

In the third part of the series on survival analysis with GAMs we will review the use of the baseline hazard estimates provided by this regression model. In contrast to the Cox mode, the log-baseline hazard is estimated along with other quantities (e.g. the log hazard ratios) by the Poisson GAM (PGAM) as:

log(h(t_{i,j})) = \beta_{0}+\lambda(t_{t,j})+\boldsymbol{x\beta} = \lambda_{I}(t_{i,j})+\boldsymbol{x\beta}

In the aforementioned expression, the baseline hazard is equivalently modeled as a time-varying deviation (\lambda(t) ) from a constant (the intercept \beta_{0}) , or as a time-varying function (\lambda_{I}(t) ). In the latter case, the constant is absorbed into the smooth term. The choice between these equivalent forms is dictated by the application at hand; in particular, the intercept may be switched on or off by centering the smooth terms appearing in the call to the gam function. Hence, in the PGAM formulation the log-baseline hazard is yet another covariate that one estimates by a smooth function; other covariates may modify this hazard in a proportional fashion by additively shifting the log-baseline hazard (\boldsymbol{x\beta}).

In the “standard” way of fitting a PGAM by mgcv, the log-baseline hazard is estimated in the constant+deviation form. Exponentiation may be used to derive the baseline hazard and its standard errors. Continuing the analysis of the Primary Biliary Cirrhosis example from the second part of the series, we may write:

plot(fGAM,main="Gauss Lobatto (5)",ylab="log-basehaz")
plot(fGAM2,main="Gauss Lobatto (10)",ylab="log-basehaz")
plot(fGAM,main="Gauss Lobatto (5)",ylab="basehaz",trans=exp)
plot(fGAM2,main="Gauss Lobatto (10)",ylab="basehaz",trans=exp)
Log Baseline (top row) and Baseline (second row) hazard function in the PBC dataset for two different discretizations of the data

Log Baseline (top row) and Baseline (second row) hazard function in the PBC dataset for two different discretizations of the data. In all these cases, the baseline hazard (or its log) are given as time varying deviations from a constant (the value of the log-hazard where the confidence interval vanishes)

There is no substantial difference in the estimated obtained by the coarse (Gauss Lobatto (5)) and finer (Gauss Lobatto (10)) discretization. Note that as a result of fitting the log-hazard as constant+ time-varying deviation, the standard error of the curve vanishes at ~1050: the value of the log-hazard at that instant in events per unit time is provided by the intercept term.

Estimation of the log-baseline hazard allows the PGAM to function as a parametric, smooth alternative to the Kaplan Meier estimator. This will be examined in the fourth part of this series.


Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and calculation of hazard ratios)

May 2, 2015

In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of sM individual observations at times \mathcal{F}=\left\{ F_{i}\right\} _{i=1}^{M} , with censoring indicators \mathcal{D}=\left\{ \delta_{i}\right\} _{i=1}^{M} assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

L=\prod_{i=1}^{M}f(F_{i})^{\delta_{i}}S(F_{i})^{1-\delta_{i}}=    \prod_{i=1}^{M}h(F_{i})^{\delta_{i}}\exp\left(-\int_{0}^{F_{i}}h(t)\mathrm{\, d}t\right)

where h(t) is the hazard function. By using an interpolatory quadrature rule, one may substitute the integral with a weighted sum evaluated at a distinct number of nodes.


where t_{i,j}, w_{i,j}  are the nodes, weights of the integration rule and d_{i,j} is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise.  By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights).  Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

A) relationship between (log) MST and the logarithm of the Maximum Hazard rate function for survival distributions with a cubic polynomial log baseline hazard function (B) Box plots of the GL error as a function of the number of nodes in the quadrature rule (C) GL error as a function of the length of the integration interval (taken equal to be equal to the MST for each distribution examined) for different orders of the quadrature rule (D) GL error as a function of the maximum value of the hazard rate for different orders of the quadrature rule.

Bounds of the Gauss Lobatto (GL) approximation error for the integration of survival data (MST=Mean Survival Time).

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme.  In our paper, this was accomplished by the adoption of the Gauss-Lobatto rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval [-1,1] depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

 while (max(abs(x-xold))>2.22044604925031e-16) {
 for (k in 2:N) {
 P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k;
 x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] )


which can be called to return a list of the nodes and their weights:

> GaussLobatto(5)
[1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000

[1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000

[1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:

## GL : Gauss Lobatto rule
## data: survival data
## fu: column number containing fu info
## d: column number with event indicator
 ## append artificial ID in the set
 ## Change the final indicator to what
 ## was observed, map node positions,
 ## weights from [-1,1] back to the
 ## study time
 gam.ev=as.numeric((gam.ev | ev)*I(stop==1)),
 ## now merge the remaining covariate info

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

> ## Change transplant to alive
> pbc$status[pbc$status==1]<-0
> ## Change event code of death(=2) to 1
> pbc$status[pbc$status==2]<-1
> head(pbc)
 id time status trt age sex ascites hepato spiders edema bili chol albumin copper
1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54
3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210
4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64
5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143
6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50
 alk.phos ast trig platelet protime stage
1 1718.0 137.95 172 190 12.2 4
2 7394.8 113.52 88 221 10.6 3
3 516.0 96.10 55 151 12.0 4
4 6121.8 60.63 92 183 10.3 4
5 671.0 113.15 72 136 10.9 3
6 944.0 93.00 63 NA 11.0 3
> GL<-GaussLobatto(5)
> pbcGAM<-GAMSurvdataset(GL,pbc,2,3)
> head(pbcGAM)
 id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders
1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1
2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1
3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1
4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1
5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1
6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1
 edema bili chol albumin copper alk.phos ast trig platelet protime stage
1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
> dim(pbc)
[1] 418 20
> dim(pbcGAM)
[1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable stop, their associated weights in the variable gam.dur and the event indicators are in the column gam.ev. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule ([-1,1]). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the log-hazard rate function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:

library(survival) ## for coxph
> library(mgcv) ## for mgcv
> ## Prop Hazards Modeling with PGAM
> fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM,family="poisson",scale=1,method="REML")
> ## Your Cox Model here
> f<-coxph(Surv(time,status)~trt+age+sex,data=pbc)
> summary(fGAM)

Family: poisson 
Link function: log

gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345236 0.655176 -15.790 < 2e-16 ***
trt 0.069546 0.181779 0.383 0.702 
age 0.038488 0.008968 4.292 1.77e-05 ***
sexf -0.370260 0.237726 -1.558 0.119 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.008 1.015 4.186 0.0417 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.249 Deviance explained = 2.25%
-REML = 693.66 Scale est. = 1 n = 1560
> f
coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc)

 coef exp(coef) se(coef) z p
trt 0.0626 1.065 0.182 0.344 7.3e-01
age 0.0388 1.040 0.009 4.316 1.6e-05
sexf -0.3377 0.713 0.239 -1.414 1.6e-01

Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 
 (106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:

> pbcGAM2<-GAMSurvdataset(GL,pbc,2,3)
> fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM2,family="poisson",scale=1,method="REML")
> summary(fGAM2)

Family: poisson 
Link function: log

gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345288 0.655177 -15.790 < 2e-16 ***
trt 0.069553 0.181780 0.383 0.702 
age 0.038487 0.008968 4.292 1.77e-05 ***
sexf -0.370340 0.237723 -1.558 0.119 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.003 1.005 4.163 0.0416 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.124 Deviance explained = 1.7%
-REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing  the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.

Survival Analysis With Generalized Additive Models : Part I (background and rationale)

May 1, 2015

After a really long break, I’d will resume my blogging activity. It is actually a full circle for me, since one of the first posts that kick started this blog, matured enough to be published in a peer-reviewed journal last week. In the next few posts I will use the R code included to demonstrate the survival fitting capabilities of Generalized Additive Models (GAMs) in real world datasets. The first post in this series will summarize the background, rationale and expected benefits to be realized by adopting GAMs from survival analysis.

In a nutshell, the basic ideas of the GAM approach to survival analysis are the following:

  1. One approximates the integral defining the survival function as a discrete sum using a quadrature rule
  2. One evaluates the likelihood at the nodes of the aforementioned quadrature rule
  3. A regression model is postulated for the log-hazard rate function
  4. As a result of 1-3 the survival regression problem is transformed into a Poisson regression one
  5. If penalized regression is used to fit the regression model, then GAM fitting software may be used for survival analysis

Ideas along the lines 1-4 have been re-surfacing in the literature ever since the Proportional Hazards Model was described. The mathematical derivations justifying Steps 1-4 are straightforward to follow and are detailed in the PLoS paper. The corresponding derivations for the Cox model are also described in a previous post.

Developments such as 1-4 were important in the first 10 years of the Cox model, since there were no off-the-shelf implementations of the partial (profile) likelihood approach. This limited the practical scope of proportional hazards modeling and set off a parallel computational line of research in how one could use other statistical software libraries to fit the Cox model.  In fact, the first known to the author application of a proportional model for the analysis of a National Institute of Health (NIH) randomized controlled trial used a software implementing a Poisson regression to calculate the hazard ratio. The trial was the NCDS trial that examined adequacy indices for the hemodialysis prescription (the description of the software was published 6 months prior to the clinical paper).  Many of these efforts were computationally demanding and died off as the Cox model was implemented in the various statistical packages after the late 80s and semi-parametric theory took off and provide a post-hoc justification for many of the nuances implicit in the Cox model.  Nevertheless, one can argue that in the era of the modern computer, no one really needs the Cox model. This technical report and the author’s work on a real world, complex dataset provides the personal background for my research on GAM approaches for survival data.

The GAM (or Poisson GAM, PGAM as called in the paper) is an extension of these old ideas (see the literature survey here and here). In particular, PGAM models the quantities that are modeled semi-parametrically (e.g. the baseline hazard) in the Cox model with parametric, flexible functions that are estimated by penalized regressio. One of the first applications of penalized regression for survival analysis is the Fine and Gray spline model, which is however not a PGAM model.  There are specific benefits to be realized from penalizing the Poisson regression and adopting GAMs  in the context of survival analysis:

  • Parsimony: degrees of freedom are not wasted as penalization will seek the most parsimonious representation (fewer degrees of freedom) among the many possible functions that may fit the data
  • Reduction of the analyst-degrees-of freedom: the shape of the functional relationships between survival and outcome are learned from the data. This limits the potential of someone putting a specific form for this relationship (e.g. linear v.s. quadratic) and running away with the most convenient p-value
  • Multiple time scale modelling: one can model more than one time scales in a dataset (i.e. not just the study time). This is useful when adjusting for secular trends in an observational dataset or even in a randomized trial. In particular cluster randomized trials at the community level may be immune to secular trends
  • Non-proportional hazards modeling: when the hazards are not proportional, the Cox model is not applicable. Many oncology datasets will demonstrate a deviation from proportionality (in fact we re-analyzed such a trial in the PLoS paper) . For a particular dataset, one would like to know whether the proportionality assumption is violated, and if so one would like to “adjust” for it. Such an adjustment will take the form of a time-varying hazard ratio function and these may be estimated with the PGAMs. In such a case, one can even extract an “average” hazard ratio while still estimating a time-varying deviation around it using the PGAM. However non-proportionality should shift the analyst to:
  • Alternative measures of treatment effect: These may include relative risks, absolute risk differences of even differences in the (Restricted) Mean Survival Time. Such measures are calculated from the time varying hazard ratios using statistical simulation techniques
  • Handling of correlated outcomes: Correlated outcomes may arise from center effects, multiple events in the same individual or even cluster randomization designs. The analysis of such outcomes is facilitated by the interpretation of the PGAM as a generalized linear mixed model and the introduction of the corresponding random effects and their correlation structure into the regression
  • Center effects: A variety of modeling options are available including stratified hazards, fixed or random effects
  • Subgroup analyses
  • Time varying external covariates
  • Unconditional/Conditional/Population averaged effects: The unconditional estimate is obtained by indexing the individuals with the group they are assigned to (e.g. placebo or drug in an interventional trial). The conditional estimate is obtained by introducing covariates (e.g. gender, age) into the regression to calculate effects for individuals that share these characteristics. The population effect averages over the conditional effects over all the individuals in the trial. In epidemiology it is known as the corrected group prognosis method. This was introduced in a JAMA paper almost 15 years ago as a way to generate adjusted survival probability curves
  • Handling of right censored/left truncated/uncensored data

These benefits follow directly from the  mixed model equivalence between semi-parametric, penalized regression and Generalized Mixed Linear Models. An excellent, survey may be found here, while Simon Wood’s book in the GAM implementation of the mgcv package in R contains a concise presentation of these ideas.

As it stands the method presented has no software  implementation similar to the survival package in R. Even though we provide R code to run the examples in the paper, the need for the various functions may not be entirely clear. Hence the next series of posts will go over the code and the steps required to fit the PGAM using the R programming language.

Extracting standard errors and treatment effects from medical journal tables (powered by R)

November 10, 2013

I decided to start blogging the R code used for some of my statistical posts, so I will start with the meta-analysis posts and move on to more difficult stuff.

As stated previously (here and here) the problem is to convert the reported relative risks(RR, t), 95% confidence interval (t_L, t_U) and p-value (p_v) into estimates for the log-relative risk ratio and its associated standard error for down-stream use (usually meta-analysis). Medical journals are in the bad habit of exponentiating (and rounding) the output of statistical software so that one needs to manipulate the reported estimates in order to recover the output of the statistical software. (more…)

Page Rev Bayes – we found statistical irregularities in a randomized controlled trial

November 9, 2013

The Bayesian counterpart to the frequentist analysis of the Randomized Controlled Trial is in many aspects more straightforward than the Bayesian analysis. One starts with a prior probability about the probability of a patient being assigned to each of the three arms and combines it with the (multinomial) likelihood of observing a given assignment pattern in the 240 patients enrolled in the study. Bayes theorem gives the posterior probability quantifying our belief about the magnitudes of the unknown assignment probabilities. Note that testing the strict equality is bound to lead us straight to the arms of the Lindley paradox so that a different approach is likely to be more fruitful. Specifically, we specify a maximum tolerable threshold for the difference between the maximum and the minimum probability of being assigned the trial arms (let’s say 1-5%) and we directly calculate the probability for this difference (“probability of foul play”).

In the absence of prior evidence for (or against) foul play we use a non-informative prior in which all possible values of assignment probabilities are equally plausible. This (Dirichlet) prior corresponds to a prior state of knowledge in which three individuals were randomized and all three ended up in different treatment arms. Under this prior, the posterior distribution is itself a Dirichlet distribution with parameters equal to the number of individuals actually assigned to each arm+1. The following R code may then be used to calculate the probability of foul play, as previously defined i.e.


This probability comes down to 0.4% which is numerically close to the frequentist answer, yet with a more intuitive interpretation: based on the observed trial sizes and a numerical tolerance for the maximum tolerable difference in assignment probability the odds for “foul play” are 249:1.
Increasing the tolerance will obviously decrease these odds, but in such a case we would be willing to tolerate larger differences in assignment probabilities. Although these results are mathematically trivial (and non-controversial), the plot will become more convoluted when one proceeds to use them to make a declaration of “foul play”. For in that case, a decision needs to be made which has to consider not only the probability of the uncertain events: “foul play” v.s. “not foul play” but also the consequences for the journal, the study investigators and the scientific community at large. At this level one would need to decide whether the odds of 249:1 are high enough or not for subsequent action to be taken. But this consideration will take us to the realm of decision theory (and it already 11pms).

The utility of frequentist statistics (in a single picture)

November 8, 2013

Ted Bunn nailed it


Time to move beyond Laplace and what prior he would have used, had he been alive today.

In the absence of real world data the effectiveness of a clinical intervention is half its efficacy (in a randomized trial)

August 14, 2013

Suppose one is approached by one’s partner with the results of a new intervention that helped X% of N carefully chosen participants in a Randomized Controlled Trial (RCT), with only minimal adverse events (seen in Y% of the N patients). The colleague, a TRUE BELIEVER – champion of innovation and defender of progress against the medical luddites of this world, wants to convince you to implement this new therapy as part of the standard protocol in your common practice. He is thinking that it should be offered to all newcomers, including patients who would not have been eligible to participate in the aforementioned trial. What would a healthy sceptic do? Champion innovation and adopt the new therapy on the spot, or defend tradition and wait? Is it possible to ground the answer in the cold, objective language of math and warm up to/cool down your partner accordingly? (more…)