Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and calculation of hazard ratios)

In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of sM individual observations at times \mathcal{F}=\left\{ F_{i}\right\} _{i=1}^{M} , with censoring indicators \mathcal{D}=\left\{ \delta_{i}\right\} _{i=1}^{M} assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

L=\prod_{i=1}^{M}f(F_{i})^{\delta_{i}}S(F_{i})^{1-\delta_{i}}=    \prod_{i=1}^{M}h(F_{i})^{\delta_{i}}\exp\left(-\int_{0}^{F_{i}}h(t)\mathrm{\, d}t\right)

where h(t) is the hazard function. By using an interpolatory quadrature rule, one may substitute the integral with a weighted sum evaluated at a distinct number of nodes.

L=\prod_{i=1}^{M}\prod_{j=1}^{N_{i}}h(t_{i})^{d_{i,j}}\exp\left(-w_{j,j}h(t_{i,j})\right)

where t_{i,j}, w_{i,j}  are the nodes, weights of the integration rule and d_{i,j} is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise.  By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights).  Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

A) relationship between (log) MST and the logarithm of the Maximum Hazard rate function for survival distributions with a cubic polynomial log baseline hazard function (B) Box plots of the GL error as a function of the number of nodes in the quadrature rule (C) GL error as a function of the length of the integration interval (taken equal to be equal to the MST for each distribution examined) for different orders of the quadrature rule (D) GL error as a function of the maximum value of the hazard rate for different orders of the quadrature rule.

Bounds of the Gauss Lobatto (GL) approximation error for the integration of survival data (MST=Mean Survival Time).

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme.  In our paper, this was accomplished by the adoption of the Gauss-Lobatto rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval [-1,1] depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

GaussLobatto<-function(N)
{
 N1<-N
 N<-N-1
 x=matrix(cos(pi*(0:N)/N),ncol=1)
 x=cos(pi*(0:N)/N)
 P<-matrix(0,N1,N1)
 xold<-2
 while (max(abs(x-xold))>2.22044604925031e-16) {
 xold<-x
 P[,1]<-1 
 P[,2]<-x
 
 for (k in 2:N) {
 P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k;
 }
 
 x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] )
 
 }

 w<-2./(N*N1*P[,N1]^2);
 ret<-list(x=rev(x),w=w)
 attr(ret,"order")<-N
 ret
}

which can be called to return a list of the nodes and their weights:

> GaussLobatto(5)
$x
[1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000

$w
[1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000

attr(,"order")
[1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:


GAMSurvdataset<-function(GL,data,fu,d)
## GL : Gauss Lobatto rule
## data: survival data
## fu: column number containing fu info
## d: column number with event indicator
 {
 ## append artificial ID in the set
 data$id<-1:nrow(data)
 Gllx<-data.frame(stop=rep(GL$x,length(data$id)),
 gam.dur=rep(GL$w,length(data$id)),
 t=rep(data[,fu],each=length(GL$x)),
 ev=rep(data[,d],each=length(GL$x)),
 id=rep(data$id,each=length(GL$x)),
 gam.ev=0,start=0)
 ## Change the final indicator to what
 ## was observed, map node positions,
 ## weights from [-1,1] back to the
 ## study time
 Gllx<-transform(Gllx,
 gam.ev=as.numeric((gam.ev | ev)*I(stop==1)),
 gam.dur=0.5*gam.dur*(t-start),
 stop=0.5*(stop*(t-start)+(t+start)))
 ## now merge the remaining covariate info
 Gllx<-merge(Gllx,data[,-c(fu,d)])
 Gllx
 }

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

data(pbc)
> ## Change transplant to alive
> pbc$status[pbc$status==1]<-0
> ## Change event code of death(=2) to 1
> pbc$status[pbc$status==2]<-1
> 
> head(pbc)
 id time status trt age sex ascites hepato spiders edema bili chol albumin copper
1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54
3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210
4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64
5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143
6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50
 alk.phos ast trig platelet protime stage
1 1718.0 137.95 172 190 12.2 4
2 7394.8 113.52 88 221 10.6 3
3 516.0 96.10 55 151 12.0 4
4 6121.8 60.63 92 183 10.3 4
5 671.0 113.15 72 136 10.9 3
6 944.0 93.00 63 NA 11.0 3
> 
> GL<-GaussLobatto(5)
> pbcGAM<-GAMSurvdataset(GL,pbc,2,3)
> head(pbcGAM)
 id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders
1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1
2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1
3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1
4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1
5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1
6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1
 edema bili chol albumin copper alk.phos ast trig platelet protime stage
1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
> 
> dim(pbc)
[1] 418 20
> dim(pbcGAM)
[1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable stop, their associated weights in the variable gam.dur and the event indicators are in the column gam.ev. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule ([-1,1]). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the log-hazard rate function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:


library(survival) ## for coxph
> library(mgcv) ## for mgcv
> 
> ## Prop Hazards Modeling with PGAM
> fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM,family="poisson",scale=1,method="REML")
> 
> ## Your Cox Model here
> f<-coxph(Surv(time,status)~trt+age+sex,data=pbc)
> 
> 
> summary(fGAM)

Family: poisson 
Link function: log

Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345236 0.655176 -15.790 < 2e-16 ***
trt 0.069546 0.181779 0.383 0.702 
age 0.038488 0.008968 4.292 1.77e-05 ***
sexf -0.370260 0.237726 -1.558 0.119 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.008 1.015 4.186 0.0417 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.249 Deviance explained = 2.25%
-REML = 693.66 Scale est. = 1 n = 1560
> f
Call:
coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc)


 coef exp(coef) se(coef) z p
trt 0.0626 1.065 0.182 0.344 7.3e-01
age 0.0388 1.040 0.009 4.316 1.6e-05
sexf -0.3377 0.713 0.239 -1.414 1.6e-01

Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 
 (106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:


GL<-GaussLobatto(10)
> pbcGAM2<-GAMSurvdataset(GL,pbc,2,3)
> fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM2,family="poisson",scale=1,method="REML")
> 
> summary(fGAM2)

Family: poisson 
Link function: log

Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345288 0.655177 -15.790 < 2e-16 ***
trt 0.069553 0.181780 0.383 0.702 
age 0.038487 0.008968 4.292 1.77e-05 ***
sexf -0.370340 0.237723 -1.558 0.119 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.003 1.005 4.163 0.0416 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.124 Deviance explained = 1.7%
-REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing  the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.

Advertisements

8 Responses to “Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and calculation of hazard ratios)”

  1. Distilled News | Data Analytics & R Says:

    […] Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and… In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible. […]

  2. Linda Says:

    Hello. I’ve sent you an email with a couple of questions via LinkedIn. Let me know if that is the best way of reaching you.

    Linda.

  3. Jesus Says:

    Do you have a mistake?. where h(t) is the hazard function. By using an interpolatory quadrature rule, one may substitute the integral with a weighted sum evaluated at a distinct number of nodes.
    It is wij in L, no wjj.

  4. Jesus Says:

    Can you provide bibliography about this method?.

    Thanks for all

    Jesis

  5. Jesus Says:

    I don´t well theorical details. I can use with gam efects for age, for example.

  6. Jesus Says:

    Are Equation 1 equation of paper well?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: