In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of *s* individual observations at times , with censoring indicators assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

where is the hazard function. By using an **interpolatory** quadrature rule, one may substitute the integral with a **weighted sum** evaluated at a distinct number of **nodes.**

where , are the nodes, weights of the integration rule and is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise. By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights). Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

Bounds of the Gauss Lobatto (GL) approximation error for the integration of survival data (MST=Mean Survival Time).

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme. In our paper, this was accomplished by the adoption of the **Gauss-Lobatto** rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

GaussLobatto<-function(N)
{
N1<-N
N<-N-1
x=matrix(cos(pi*(0:N)/N),ncol=1)
x=cos(pi*(0:N)/N)
P<-matrix(0,N1,N1)
xold<-2
while (max(abs(x-xold))>2.22044604925031e-16) {
xold<-x
P[,1]<-1
P[,2]<-x
for (k in 2:N) {
P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k;
}
x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] )
}
w<-2./(N*N1*P[,N1]^2);
ret<-list(x=rev(x),w=w)
attr(ret,"order")<-N
ret
}

which can be called to return a list of the nodes and their weights:

> GaussLobatto(5)
$x
[1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000
$w
[1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000
attr(,"order")
[1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:

GAMSurvdataset<-function(GL,data,fu,d)
## GL : Gauss Lobatto rule
## data: survival data
## fu: column number containing fu info
## d: column number with event indicator
{
## append artificial ID in the set
data$id<-1:nrow(data)
Gllx<-data.frame(stop=rep(GL$x,length(data$id)),
gam.dur=rep(GL$w,length(data$id)),
t=rep(data[,fu],each=length(GL$x)),
ev=rep(data[,d],each=length(GL$x)),
id=rep(data$id,each=length(GL$x)),
gam.ev=0,start=0)
## Change the final indicator to what
## was observed, map node positions,
## weights from [-1,1] back to the
## study time
Gllx<-transform(Gllx,
gam.ev=as.numeric((gam.ev | ev)*I(stop==1)),
gam.dur=0.5*gam.dur*(t-start),
stop=0.5*(stop*(t-start)+(t+start)))
## now merge the remaining covariate info
Gllx<-merge(Gllx,data[,-c(fu,d)])
Gllx
}

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

data(pbc)
> ## Change transplant to alive
> pbc$status[pbc$status==1]<-0
> ## Change event code of death(=2) to 1
> pbc$status[pbc$status==2]<-1
>
> head(pbc)
id time status trt age sex ascites hepato spiders edema bili chol albumin copper
1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54
3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210
4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64
5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143
6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50
alk.phos ast trig platelet protime stage
1 1718.0 137.95 172 190 12.2 4
2 7394.8 113.52 88 221 10.6 3
3 516.0 96.10 55 151 12.0 4
4 6121.8 60.63 92 183 10.3 4
5 671.0 113.15 72 136 10.9 3
6 944.0 93.00 63 NA 11.0 3
>
> GL<-GaussLobatto(5)
> pbcGAM<-GAMSurvdataset(GL,pbc,2,3)
> head(pbcGAM)
id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders
1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1
2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1
3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1
4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1
5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1
6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1
edema bili chol albumin copper alk.phos ast trig platelet protime stage
1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
>
> dim(pbc)
[1] 418 20
> dim(pbcGAM)
[1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable **stop, **their associated weights in the variable **gam.dur **and the event indicators are in the column **gam.ev**. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule (). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the **log-hazard rate **function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:

library(survival) ## for coxph
> library(mgcv) ## for mgcv
>
> ## Prop Hazards Modeling with PGAM
> fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM,family="poisson",scale=1,method="REML")
>
> ## Your Cox Model here
> f<-coxph(Surv(time,status)~trt+age+sex,data=pbc)
>
>
> summary(fGAM)
Family: poisson
Link function: log
Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.345236 0.655176 -15.790 < 2e-16 ***
trt 0.069546 0.181779 0.383 0.702
age 0.038488 0.008968 4.292 1.77e-05 ***
sexf -0.370260 0.237726 -1.558 0.119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(stop) 1.008 1.015 4.186 0.0417 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = -0.249 Deviance explained = 2.25%
-REML = 693.66 Scale est. = 1 n = 1560
> f
Call:
coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc)
coef exp(coef) se(coef) z p
trt 0.0626 1.065 0.182 0.344 7.3e-01
age 0.0388 1.040 0.009 4.316 1.6e-05
sexf -0.3377 0.713 0.239 -1.414 1.6e-01
Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125
(106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:

GL<-GaussLobatto(10)
> pbcGAM2<-GAMSurvdataset(GL,pbc,2,3)
> fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM2,family="poisson",scale=1,method="REML")
>
> summary(fGAM2)
Family: poisson
Link function: log
Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.345288 0.655177 -15.790 < 2e-16 ***
trt 0.069553 0.181780 0.383 0.702
age 0.038487 0.008968 4.292 1.77e-05 ***
sexf -0.370340 0.237723 -1.558 0.119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(stop) 1.003 1.005 4.163 0.0416 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = -0.124 Deviance explained = 1.7%
-REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.