Archive for November, 2013

The little non-informative prior that could (be informative)

November 26, 2013

Christian Robert reviewed on line a paper that was critical of non-informative priors. Among the points that were discussed by him and other contributors (e.g. Keith O’Rourke), was the issue of induced priors, i.e. priors which arise from a transformation of original parameters, or of observables. I found this exchange interesting because I did something similar when revisiting an old research project that had been collecting digital dust in my hard disk. The specific problem had to do with analysis of a biomarker that was measured with a qualitative technique yielding a binary classification of measurements as present or absent, in two experimental conditions (call them A and B). Ignoring some technical aspects of the study design, the goal was to calculate the odds ratio of the biomarker being expressed in condition B v.s A (the reference state signifying absence of disease).

When writing the programs for the analysis, I defaulted to the N(0.0,1.0E-6) prior that epitomizes non-informativeness in BUGS. However, one of my co-authors asked the “What the @#$%& does this prior mean?” question. And then we stopped … and reflected on what we were about to do. You see, before the experiment started we had absolutely no prior information about the behaviour of the biomarker in either experimental state so that we did not want to commit one way or another. In other words, Laplace’s original uniform (or Beta(1,1)) prior would have been reasonable if the expression data for  A and B were to be analyzed separately. However, we wanted to analyze the data with a logistic regression model, so was the ubiquitous N(0.0,1.0E-6) the prior we were after?

The answer is a loud NO! According to Wikipedia, the mother of all knowledge, the logistic transformation of a uniform variate is the logistic distribution with location of zero and scale of 1. Hence, the prior on the intercept of the logistic regression (interpretable as the odds of the biomarker being expressed in state A) had to be a Logistic(0,1).

UAsLogistic

Surprisingly the Odds Ratio of B v.s. A was found (after trial and error and method of moments considerations) to be very well approximated by a 1:1 mixture of a logistic and a Gaussian which clearly departs from the N(0.0,1.0-6) prior we (almost) used:

ORs

Bottom line: Even informative (in the BUGS sense!) priors can be pretty non-informative in some intuitively appropriate parameterization. Conversely, one could start with a non-informative prior in a parameterization that is easier to reason about and look for an induced prior (using analytic considerations or even simulations) to convert it to a parameterization that is more appropriate to the analytic plan at hand.

(R code for the plots and simulations is given below)

## approximating uniforms
logit<-function(x) log(x/(1-x))
set.seed(1234)
N<-10000000
s<-runif(N,0,1);
s2<-runif(N,0,1);
y<-logit(s)
y2<-logit(s2)
m<-mean(y)
s<-sd(y)
x<-seq(-10,10,.1)
## logistic is logit of a uniform
hist(y,prob=TRUE,breaks=50,main="intercept",
     xlab="logit(A)")
lines(x,dnorm(x,m,s),col="red")
lines(x,dlogis(x,0,1),col="blue")
legend(-15,0.20,legend=c("Normal(0,1)",
      "Logistic(0,1)"),lty=1,col=c("blue","red") )

## approximating the difference of two uniforms
hist(y-y2,prob=TRUE,ylim=c(0,.25),breaks=200,
     xlim=c(-10,10),main="OR between two U(0,1)",
     xlab="logit(B)-logit(A)")
## logistic approximation
lines(x,dlogis(x,0,sqrt(2)),col="blue",lwd=2)
## normal
lines(x,dnorm(x,0,(pi)*sqrt(2/3)),col="red",lwd=2)
## mixture of a logistic and a normal approximation
lines(x,0.5*(dlogis(x,0,sqrt(2))+
     dnorm(x,0,(pi)*sqrt(2/3))),col="green",lwd=2)
## legends
NL<-expression(paste("Normal(0,",pi*sqrt(2/3),")"))
LL<-expression(paste("Logistic(0,",sqrt(2),")"))
ML<-expression(paste("0.5 Normal(0,",pi*sqrt(2/3),")+0.5 Logistic(0,",sqrt(2),")"))
legend(-6.5,0.25,legend=c(NL,LL,ML),
       lty=1,col=c("blue","red","green") )

## does it extend to more general cases?
m1<--2;m2<-2;s1<-1;s2<-2.5;
l1<-rlogis(N,m1,s1)
l2<-rlogis(N,m2,s2)
d<-l1-l2
hist(d,prob=TRUE,ylim=c(0,0.25),breaks=200)
plot(density(d))
lines(x,dlogis(x,m1-m2,sqrt(s1^2+s2^2)),col="green",lwd=2)
lines(x,dnorm(x,m1-m2,pi*sqrt((s1^2+s2^2)/3)),col="red",lwd=2)
lines(x,0.5*(dnorm(x,m1-m2,pi*sqrt((s1^2+s2^2)/3))+
             dlogis(x,m1-m2,sqrt(s1^2+s2^2))),col="blue",lwd=2)


Edit (29/11/2013):
Updated the first image due to an accidental reversal of the distribution labels

Bayesian Linear Regression Analysis (with non-informative priors but without Monte Carlo) In R

November 24, 2013

Continuing the previous post concerning linear regression analysis with non-informative priors in R, I will show how to derive numerical summaries for the regression parameters without Monte Carlo integration. The theoretical background for this post is contained in Chapter 14 of Bayesian Data Analysis which should be consulted for more information.

The Residual Standard Deviation

The residual standard deviation \sigma is just the square root of the residual variance \sigma^2 which has a scaled inverse chi-square distribution given the data and the covariates: \sigma^2 \sim Scale-inv-\chi^2(\nu,s^2) with \nu=N-p and s^2 are the residual degrees of freedom and residual variance reported by the (frequentist) least square fit (R function lm). Possible point estimates for the residual standard deviation are the posterior mean, the mode and the median which can be obtained through (numerical) integration of the probability density function (PDF), maximization of the PDF and the quantile function respectively. The standard deviation may also be obtained via univariate numerical integration of the PDF. To obtain the latter, we apply a standard change of variables transformation to the scaled inverse chi-square PDF to obtain:

p(\sigma|y) = \frac{s^2 \nu/2}{\Gamma(\nu/2)}\times exp(-\frac{\nu s^2}{2 \sigma^2})\times \sigma^{-(n+2)}

where \Gamma(x) is the Gamma function. Comparing expressions, the Cumulative Density Function CDF of \sigma|y can be seen to be equal to the survival function of the gamma distributed random variable : 1-F_{\Gamma}(\sigma^{-2};\frac{\nu}{2},\frac{2}{\nu s^2}), where F_{\Gamma}(x;k,\theta) is the value, at x, of the CDF of the gamma variable with degrees of freedom and scale k, \theta respectively. The median (or any other quantile i.e. latex q$) can be obtained by solving the equation:

1-F_{\Gamma}(\sigma^{-2};\frac{\nu}{2},\frac{2}{\nu s^2}) = q

No closed form appears to exist for the quantiles (median, lower and upper limits of a symmetric credible interval, CrI) so that the equation above needs to be solved numerically. Since the marginal \sigma |y is a skewed distribution, the mean bounds (from above) both the median and the lower limit of thecredible (not to be confused with a confidence!) interval. Similarly, the upper limit of the CrI is bound by the corresponding limit of a Gaussian with mean and standard deviation equal to the values we previously obtained through numerical integration. These observations are used to restrict the range over which numerical optimization will be carried out, to avoid false convergence or outright failures.

The mode can be obtained by direct differentiation of the PDF and is given by the closed form expression: s\sqrt{\frac{n}{n+1}}

The Regression Coefficients

The marginal distribution of the entire coefficient vector \beta | y is a multivariate T distribution with location vector \hat{\beta} (obtained with lm) and scale matrix V_ {\beta}\sigma^2. Both these quantities may be obtained from the values returned by lm. To calculate numerical summaries for each of the components of this vector the other components need to be integrated out (marginalization). It is a basic property of the multivariate T distribution that its marginal  distributions are univariate t so that the following relation holds for each regression parameter:

\frac{\beta_i - \hat{\beta_i}}{s \sqrt{V_{\beta}[i,i]}} \sim t_{\nu}

Since this is a symmetric unimodal distribution, the mean, mode and median coincide these are all equal to the maximum likelihood estimates, while its standard deviation is available in closed form: \sqrt{\frac{n}{n-2}}. Since the quantiles univariate t are already available in R there is no need to write special code to obtain any of numerical summaries of the regression coefficients: one simply calculates the relevant quantities from the univariate t with degrees of freedom equal to \nu =n-p once, and translates/scales using the appropriate elements of the location vector and scale matrix.

The Gory Details

Two R functions are used to carry out the aforementioned tasks, i.e. one function that provides the integrand for the mean of the scaled inverse chi square distribution:

## integrand for the mean of the scaled inverse
##chisq2 with df=n, scale=t2
mn.scIX2.sqrt<-function(x,n,t2)
{
	s2<-x^2
	n.2<-n/2.0
	lx<-log(x)
	2.0*x*x*exp(-n*t2/(2.0*s2)-lgamma(n.2)+
	n.2*(log(t2)+log(n.2))-(n+2)*lx)
}

and a second (much longer!) function that receives the fitted lm object and calculates the numerical summaries for the regression coefficients and the residual standard deviation:

bayesfitAnal<-function(lmfit){
	## extract coefficients, dfs and variance
	## matrix from lm
    	QR<-lmfit$qr
    	df<-lmfit$df
    	R<-qr.R(QR) ## R component
    	coef<-lmfit$coef
    	Vb<-chol2inv(R) ## variance(unscaled)
    	s2<-(t(lmfit$residuals)%*%lmfit$residuals)
    	s2<-s2[1,1]/df
    	scale<-sqrt(s2*diag(Vb))

	## standard errors of the univariate t
	se=scale*ifelse(df>2,
	sqrt(df/(df-2)),
	ifelse(df<1,NA,Inf))
	## dataframe for returned values
    	ret<-data.frame(coef=coef,se=se,t=coef/se,
		mode=coef,median=coef,
		"CrI.2.5%"=qt(0.025,df=df),
		"CrI.97.5%"=qt(0.975,df=df))
	## CrI limits for t-distributed quantities
	ret[,6:7]<-ret[,6:7]*se+coef

	## make extra space for sigma
	ret<-rbind(ret,rep(0,7))
	rownames(ret)[3]<-"sigma"
	## mean of scale
	M1<-integrate(mn.scIX2.sqrt,n=df,
		t2=s2,lower=0,upper=Inf)$val
	## expectation of the square of the scale
	## (which is equal to the expectation of the
	## initial inverse chi square distribution
	S1<-ifelse(df<=2,NA,
		df*s2/(df-2))
	ret[3,1]<-M1	## mean
	ret[3,2]<-sqrt(S1-M1^2)	## sd
	ret[3,3]<-ret[3,1]/ret[3,2]	## t
	ret[3,4]<-sqrt(s2*df/(df+1)) ## mode
	## calculate quantiles - these take
	## place in the scale of the precision
	## median
	ret[3,5]<-uniroot(function(x) pgamma(x,
		shape=df/2,scale=2/(s2*df),
		lower.tail=FALSE)-0.5,
		lower=0,upper=1/s2)$root
	## lower limit of 95% CrI
	ret[3,6]<-uniroot(function(x) pgamma(x,
		shape=df/2,scale=2/(s2*df),
		lower.tail=FALSE)-0.025,lower=0,
		upper=1/(M1-3*ret[3,2])^2)$root
	## upper limit of 95% CrI
	ret[3,7]<-uniroot(function(x) pgamma(x,
		shape=df/2,scale=2/(s2*df),
		lower.tail=FALSE)-0.975,lower=0,
		upper=1/s2)$root
	## raise to -1/2 to change from
	## precision to standard deviations
	ret[3,5:7]<-sqrt(1/ret[3,5:7])

	ret
}

To see some action I will revisit the linear regression example introduced in the first post and compare frequentist (lm) and Bayesian estimates obtained with Monte Carlo and non-Monte Carlo approaches:

 summary(lmfit) ## Frequentist

Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0710 -0.4938  0.0685  0.2462  1.3690

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.07308,   Adjusted R-squared:  0.02158
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

> t(apply(bf,2,Bayes.sum)) ## Bayesian (MC)
                  mean        se         t     median   CrI.2.5% CrI.97.5%
(Intercept)  5.0321144 0.2338693 21.516785  5.0326123  4.5709622 5.4947022
groupTrt    -0.3687563 0.3305785 -1.115488 -0.3706862 -1.0210569 0.2862806
sigma        0.7270983 0.1290384  5.634745  0.7095935  0.5270569 1.0299920
> bayesfitAnal(lmfit) ## Bayesian (non-MC)
                  coef        se         t       mode     median   CrI.2.5. CrI.97.5.
(Intercept)  5.0320000 0.2335761 21.543296  5.0320000  5.0320000  4.5412747 5.5227253
groupTrt    -0.3710000 0.3303265 -1.123131 -0.3710000 -0.3710000 -1.0649903 0.3229903
sigma        0.7271885 0.1295182  5.614566  0.6778158  0.7095623  0.5262021 1.0298379

The conservative, skeptical nature of the Bayesian inference (wider confidence intervals, due to proper acknowledgement of uncertainty) is evident no matter the numerical approach (Monte Carlo v.s. numerical integration) one uses for the inference. Although the numerical estimates of the regression coefficients agree (up to three decimal points) among the three numerical approaches, residual variance estimates don’t for this relatively small data set.

Monte Carlo estimates are almost as precise as the numerical integration for things like the mean and standard error of the estimated parameters, yet they lose precision for extreme quantilies. Monte Carlo is also slower, which may become an issue when fitting larger models.

That was all folks! In subsequent posts Iwill explore features of the Bayesian solution  that will (hopefully) convince you that it is a much better alternative (especially when it doesn’t bring tears to your eyes!) relative to the frequentist one. If you use any of the code in these posts, I’d appreciate if you can drop a line 🙂

Kicking ass with Bayesian Statistics in R

November 22, 2013

Some excellent R posts regarding Bayesian statistics:

1. How to program the Laplace approximation in R:

http://www.r-bloggers.com/easy-laplace-approximation-of-bayesian-models-in-r/

Though heavily dominated by Monte Carlo methods, Bayesian computation with the Laplace expansion is a nice tool to deploy in cases your MCMC fails to converge. Plus it makes one appreciate Laplace’s genius.

2. A bird’s eye view of R’s Bayesian analysis facilities:

http://blog.revolutionanalytics.com/2013/11/r-and-bayesian-statistics.html

Watch this blog for a series of posts about Bayesian survival analysis with R, BUGS and stan.

Bayesian linear regression analysis without tears (R)

November 17, 2013

Bayesian methods are sure to get some publicity after Vale Johnson’s PNAS paper regarding the use of Bayesian approaches to recalibrate p-value cutoffs from 0.05 to 0.005. Though the paper itself is bound to get some heat (see the discussion in Andrew Gelman’s blog and Matt Briggs’s fun-to-read deconstruction), the controversy might stimulate people to explore Bayesianism and (hopefully!) to move away from frequentist analyses. (more…)

Extracting standard errors and treatment effects from medical journal tables (powered by R)

November 10, 2013

I decided to start blogging the R code used for some of my statistical posts, so I will start with the meta-analysis posts and move on to more difficult stuff.

As stated previously (here and here) the problem is to convert the reported relative risks(RR, t), 95% confidence interval (t_L, t_U) and p-value (p_v) into estimates for the log-relative risk ratio and its associated standard error for down-stream use (usually meta-analysis). Medical journals are in the bad habit of exponentiating (and rounding) the output of statistical software so that one needs to manipulate the reported estimates in order to recover the output of the statistical software. (more…)

Page Rev Bayes – we found statistical irregularities in a randomized controlled trial

November 9, 2013

The Bayesian counterpart to the frequentist analysis of the Randomized Controlled Trial is in many aspects more straightforward than the Bayesian analysis. One starts with a prior probability about the probability of a patient being assigned to each of the three arms and combines it with the (multinomial) likelihood of observing a given assignment pattern in the 240 patients enrolled in the study. Bayes theorem gives the posterior probability quantifying our belief about the magnitudes of the unknown assignment probabilities. Note that testing the strict equality is bound to lead us straight to the arms of the Lindley paradox so that a different approach is likely to be more fruitful. Specifically, we specify a maximum tolerable threshold for the difference between the maximum and the minimum probability of being assigned the trial arms (let’s say 1-5%) and we directly calculate the probability for this difference (“probability of foul play”).

In the absence of prior evidence for (or against) foul play we use a non-informative prior in which all possible values of assignment probabilities are equally plausible. This (Dirichlet) prior corresponds to a prior state of knowledge in which three individuals were randomized and all three ended up in different treatment arms. Under this prior, the posterior distribution is itself a Dirichlet distribution with parameters equal to the number of individuals actually assigned to each arm+1. The following R code may then be used to calculate the probability of foul play, as previously defined i.e.

 event<-c(105,70,65)
set.seed(1234);
r<-rdirichlet(10000,event+1);
res0<-mean(apply(r,1,function(x,tol)
I(abs(max(x)-min(x))<=tol),0.01))
res0*100

This probability comes down to 0.4% which is numerically close to the frequentist answer, yet with a more intuitive interpretation: based on the observed trial sizes and a numerical tolerance for the maximum tolerable difference in assignment probability the odds for “foul play” are 249:1.
Increasing the tolerance will obviously decrease these odds, but in such a case we would be willing to tolerate larger differences in assignment probabilities. Although these results are mathematically trivial (and non-controversial), the plot will become more convoluted when one proceeds to use them to make a declaration of “foul play”. For in that case, a decision needs to be made which has to consider not only the probability of the uncertain events: “foul play” v.s. “not foul play” but also the consequences for the journal, the study investigators and the scientific community at large. At this level one would need to decide whether the odds of 249:1 are high enough or not for subsequent action to be taken. But this consideration will take us to the realm of decision theory (and it already 11pms).

The utility of frequentist statistics (in a single picture)

November 8, 2013

Ted Bunn nailed it

image

Time to move beyond Laplace and what prior he would have used, had he been alive today.

Failed Randomization In A Randomized Trial?

November 5, 2013

We will continue the saga of the three-arm clinical trial that is giving the editors of the prestigious journal The Spleen a run for their money. While the polls are gathering digital dust, let’s see if we can direct this discussion to a more quantitative path. To do so, we will ask (and answer) the question from a frequentist point; according to this approach we raise the red flag if the event under examination is rare assuming a hypothesis about the state of the world (null hypothesis H_0) is true.

In this case the null hypothesis is that the investigators at Grand Fenwick Memorial did run a randomized control trial under a simple randomization scheme, in which each patient had equal chance to be given one of the three interventions: GML, SL or MBL. To calculate the rarity of the observed pattern, we need to define an appropriate event and then figure out its rarity (“long-term frequency”) in many repetitions of the randomization allocation scheme used in the trial.

Considering the number of patients in the three arms of the trial, 105/70/65, v.s. the expectation of 80/80/80  it would appear that the most influential factor in determining the “rarity” of the observed pattern is the difference in size between the largest and the smallest arm in the trial.  On the other hand a difference of 5 between the second largest and the smallest arms would not appear to be worthy of consideration, at least as a first approximation. To determine the long term frequency of the event in a trial with 240 patients, we will use the R language to carry out a large number of these hypothetical allocations and figure out the number of those in which the difference in size between the largest and smallest arms exceeds 40:

 event<-c(105,70,65)  ## observed pattern
## computes the difference in size between arms
frequentist2<-function(x,l1=40) {
x<-sort(x,decreasing=TRUE)
I((x[1]-x[3])>=l1)
}
set.seed(4567) ## for reproducibility
## hypothetical trials
g<-t(rmultinom(500000,sum(prob),c(1,1,1)))
## flags the repetitions of the studies in which a rare
## event was observed and calculates the frequency (in %)
res3<-apply(g,1,frequentist2);mean(res3)*100
 

This number comes out to be 0.5%. In other words, 1 out of 200 randomized trials that assign patients with equal probability to three arms will generate an imbalance of this magnitude.
But is this the answer we are trying to obtain? In other words the situation that the editors of The Spleen face is to evaluate the likelihood that patients were not randomly assigned to the three interventions. This evaluation is only indirectly related to the rarity of observing a large size difference in the arms of a trial that did not cheat. By not considering directly the hypothesis of foul-play (unequal allocation probabilities in the three arms), both the investigators and their accusers will find themselves in endless quarrel about the interpretation of rarity as a chance finding v.s. an improbable one indicative of fraud.

The probability of stacked mass murders is not so small

November 4, 2013

Christian Robert estimates the probability of the observed pattern of 4 mass murders in 4 days and founds it to be 2.8% per year or 18% in any seven year period!

http://xianblog.wordpress.com/2013/11/04/unusual-timing-shows-how-random-mass-murder-can-be-or-not/

Diagrams for hierarchical models

November 2, 2013

Nice alternative to BUGS like diagrams.
Can communicate distributional assumptions, grouping of variables and seems a good alternative to the inclusion of code in long online supplements!!

http://t.co/wOyu1mSSaY