Bayesian Linear Regression Analysis (with non-informative priors but without Monte Carlo) In R

Continuing the previous post concerning linear regression analysis with non-informative priors in R, I will show how to derive numerical summaries for the regression parameters without Monte Carlo integration. The theoretical background for this post is contained in Chapter 14 of Bayesian Data Analysis which should be consulted for more information.

The Residual Standard Deviation

The residual standard deviation \sigma is just the square root of the residual variance \sigma^2 which has a scaled inverse chi-square distribution given the data and the covariates: \sigma^2 \sim Scale-inv-\chi^2(\nu,s^2) with \nu=N-p and s^2 are the residual degrees of freedom and residual variance reported by the (frequentist) least square fit (R function lm). Possible point estimates for the residual standard deviation are the posterior mean, the mode and the median which can be obtained through (numerical) integration of the probability density function (PDF), maximization of the PDF and the quantile function respectively. The standard deviation may also be obtained via univariate numerical integration of the PDF. To obtain the latter, we apply a standard change of variables transformation to the scaled inverse chi-square PDF to obtain:

p(\sigma|y) = \frac{s^2 \nu/2}{\Gamma(\nu/2)}\times exp(-\frac{\nu s^2}{2 \sigma^2})\times \sigma^{-(n+2)}

where \Gamma(x) is the Gamma function. Comparing expressions, the Cumulative Density Function CDF of \sigma|y can be seen to be equal to the survival function of the gamma distributed random variable : 1-F_{\Gamma}(\sigma^{-2};\frac{\nu}{2},\frac{2}{\nu s^2}), where F_{\Gamma}(x;k,\theta) is the value, at x, of the CDF of the gamma variable with degrees of freedom and scale k, \theta respectively. The median (or any other quantile i.e. latex q$) can be obtained by solving the equation:

1-F_{\Gamma}(\sigma^{-2};\frac{\nu}{2},\frac{2}{\nu s^2}) = q

No closed form appears to exist for the quantiles (median, lower and upper limits of a symmetric credible interval, CrI) so that the equation above needs to be solved numerically. Since the marginal \sigma |y is a skewed distribution, the mean bounds (from above) both the median and the lower limit of thecredible (not to be confused with a confidence!) interval. Similarly, the upper limit of the CrI is bound by the corresponding limit of a Gaussian with mean and standard deviation equal to the values we previously obtained through numerical integration. These observations are used to restrict the range over which numerical optimization will be carried out, to avoid false convergence or outright failures.

The mode can be obtained by direct differentiation of the PDF and is given by the closed form expression: s\sqrt{\frac{n}{n+1}}

The Regression Coefficients

The marginal distribution of the entire coefficient vector \beta | y is a multivariate T distribution with location vector \hat{\beta} (obtained with lm) and scale matrix V_ {\beta}\sigma^2. Both these quantities may be obtained from the values returned by lm. To calculate numerical summaries for each of the components of this vector the other components need to be integrated out (marginalization). It is a basic property of the multivariate T distribution that its marginal  distributions are univariate t so that the following relation holds for each regression parameter:

\frac{\beta_i - \hat{\beta_i}}{s \sqrt{V_{\beta}[i,i]}} \sim t_{\nu}

Since this is a symmetric unimodal distribution, the mean, mode and median coincide these are all equal to the maximum likelihood estimates, while its standard deviation is available in closed form: \sqrt{\frac{n}{n-2}}. Since the quantiles univariate t are already available in R there is no need to write special code to obtain any of numerical summaries of the regression coefficients: one simply calculates the relevant quantities from the univariate t with degrees of freedom equal to \nu =n-p once, and translates/scales using the appropriate elements of the location vector and scale matrix.

The Gory Details

Two R functions are used to carry out the aforementioned tasks, i.e. one function that provides the integrand for the mean of the scaled inverse chi square distribution:

## integrand for the mean of the scaled inverse
##chisq2 with df=n, scale=t2
mn.scIX2.sqrt<-function(x,n,t2)
{
	s2<-x^2
	n.2<-n/2.0
	lx<-log(x)
	2.0*x*x*exp(-n*t2/(2.0*s2)-lgamma(n.2)+
	n.2*(log(t2)+log(n.2))-(n+2)*lx)
}

and a second (much longer!) function that receives the fitted lm object and calculates the numerical summaries for the regression coefficients and the residual standard deviation:

bayesfitAnal<-function(lmfit){
	## extract coefficients, dfs and variance
	## matrix from lm
    	QR<-lmfit$qr
    	df<-lmfit$df
    	R<-qr.R(QR) ## R component
    	coef<-lmfit$coef
    	Vb<-chol2inv(R) ## variance(unscaled)
    	s2<-(t(lmfit$residuals)%*%lmfit$residuals)
    	s2<-s2[1,1]/df
    	scale<-sqrt(s2*diag(Vb))

	## standard errors of the univariate t
	se=scale*ifelse(df>2,
	sqrt(df/(df-2)),
	ifelse(df<1,NA,Inf))
	## dataframe for returned values
    	ret<-data.frame(coef=coef,se=se,t=coef/se,
		mode=coef,median=coef,
		"CrI.2.5%"=qt(0.025,df=df),
		"CrI.97.5%"=qt(0.975,df=df))
	## CrI limits for t-distributed quantities
	ret[,6:7]<-ret[,6:7]*se+coef

	## make extra space for sigma
	ret<-rbind(ret,rep(0,7))
	rownames(ret)[3]<-"sigma"
	## mean of scale
	M1<-integrate(mn.scIX2.sqrt,n=df,
		t2=s2,lower=0,upper=Inf)$val
	## expectation of the square of the scale
	## (which is equal to the expectation of the
	## initial inverse chi square distribution
	S1<-ifelse(df<=2,NA,
		df*s2/(df-2))
	ret[3,1]<-M1	## mean
	ret[3,2]<-sqrt(S1-M1^2)	## sd
	ret[3,3]<-ret[3,1]/ret[3,2]	## t
	ret[3,4]<-sqrt(s2*df/(df+1)) ## mode
	## calculate quantiles - these take
	## place in the scale of the precision
	## median
	ret[3,5]<-uniroot(function(x) pgamma(x,
		shape=df/2,scale=2/(s2*df),
		lower.tail=FALSE)-0.5,
		lower=0,upper=1/s2)$root
	## lower limit of 95% CrI
	ret[3,6]<-uniroot(function(x) pgamma(x,
		shape=df/2,scale=2/(s2*df),
		lower.tail=FALSE)-0.025,lower=0,
		upper=1/(M1-3*ret[3,2])^2)$root
	## upper limit of 95% CrI
	ret[3,7]<-uniroot(function(x) pgamma(x,
		shape=df/2,scale=2/(s2*df),
		lower.tail=FALSE)-0.975,lower=0,
		upper=1/s2)$root
	## raise to -1/2 to change from
	## precision to standard deviations
	ret[3,5:7]<-sqrt(1/ret[3,5:7])

	ret
}

To see some action I will revisit the linear regression example introduced in the first post and compare frequentist (lm) and Bayesian estimates obtained with Monte Carlo and non-Monte Carlo approaches:

 summary(lmfit) ## Frequentist

Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0710 -0.4938  0.0685  0.2462  1.3690

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.07308,   Adjusted R-squared:  0.02158
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

> t(apply(bf,2,Bayes.sum)) ## Bayesian (MC)
                  mean        se         t     median   CrI.2.5% CrI.97.5%
(Intercept)  5.0321144 0.2338693 21.516785  5.0326123  4.5709622 5.4947022
groupTrt    -0.3687563 0.3305785 -1.115488 -0.3706862 -1.0210569 0.2862806
sigma        0.7270983 0.1290384  5.634745  0.7095935  0.5270569 1.0299920
> bayesfitAnal(lmfit) ## Bayesian (non-MC)
                  coef        se         t       mode     median   CrI.2.5. CrI.97.5.
(Intercept)  5.0320000 0.2335761 21.543296  5.0320000  5.0320000  4.5412747 5.5227253
groupTrt    -0.3710000 0.3303265 -1.123131 -0.3710000 -0.3710000 -1.0649903 0.3229903
sigma        0.7271885 0.1295182  5.614566  0.6778158  0.7095623  0.5262021 1.0298379

The conservative, skeptical nature of the Bayesian inference (wider confidence intervals, due to proper acknowledgement of uncertainty) is evident no matter the numerical approach (Monte Carlo v.s. numerical integration) one uses for the inference. Although the numerical estimates of the regression coefficients agree (up to three decimal points) among the three numerical approaches, residual variance estimates don’t for this relatively small data set.

Monte Carlo estimates are almost as precise as the numerical integration for things like the mean and standard error of the estimated parameters, yet they lose precision for extreme quantilies. Monte Carlo is also slower, which may become an issue when fitting larger models.

That was all folks! In subsequent posts Iwill explore features of the Bayesian solution  that will (hopefully) convince you that it is a much better alternative (especially when it doesn’t bring tears to your eyes!) relative to the frequentist one. If you use any of the code in these posts, I’d appreciate if you can drop a line 🙂

Advertisements

Tags: , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: