## Extracting standard errors and treatment effects from medical journal tables (powered by R)

I decided to start blogging the R code used for some of my statistical posts, so I will start with the meta-analysis posts and move on to more difficult stuff.

As stated previously (here and here) the problem is to convert the reported relative risks(RR, $t$), 95% confidence interval ($t_L, t_U$) and p-value ($p_v$) into estimates for the log-relative risk ratio and its associated standard error for down-stream use (usually meta-analysis). Medical journals are in the bad habit of exponentiating (and rounding) the output of statistical software so that one needs to manipulate the reported estimates in order to recover the output of the statistical software. The simplest approach is one based on algebraic manipulations and disregards the finite precision of the medical journal table, treating the reported figures as exact (i.e. 1.50 is exactly 1.50 not a number between 1.495-1.504):

1. First we obtain an estimate for $t$ from the reported RR: $t_1 = \log(t)$
2. Secondly we transform the confidence interval to estimates for $t, se$:
1. $t_2 = \log(\sqrt{H_U \times H_L})$
2. $se_1 = \log(\sqrt{\frac{H_U}{H_L}}) \times (q_{0.975})^{-1}$
3. Averaging over $t_1, t_2$ yields $t=0.5 \times(t_1+t_2)$
4. A second estimate for the $se$ is obtained from the p-value using the quantile function $q_N(x)$ of the normal distribution: $se_2 = \vert{ \frac{t}{q_N(p_v)}}\vert$
5. Averaging over $se_1, se_2$ yields $= se = 0.5\times(se_1+se_2)$

The following R function can be used to carry out these transformations:

algSTE<-function(X,L,U,p)
{
v1<-c(log(X),log(sqrt(U*L)))
alg.X<-mean(v1[is.finite(v1)])
v2<-c(log(sqrt(U/L))/qnorm(0.975),abs(alg.X/qnorm(p/2)))
alg.Y<-mean(v2[is.finite(v2)])
return(list("TE"=alg.X,"se"=alg.Y))
}


A potentially more accurate approach is to acknowledge the finite precision inherent in the medical journal table for all quantities of interest (treatment effect, 95% confidence intervals and p-values). In a previous post I gave a Bayesian solution to this problem which obeys the $d$ precision of the treatment effect and confidence interval and the $d_p$ precision of the p-value. The R code for the Bayesian algorithm is the following:

</pre>
## function that extracts treatment effect (usually in log scale) and
## standard error from treatment effect (usually exponentiated log-odds ratios
## or log-relative risks), associated pci% CI and p-value rounded to
## d (TE, 95% CI) and dp (p-value) significant digits.
## This is based on a statistical simulation algorithm that draws
## N samples from the posterior distribution of these parameters
## followed by a rejection step. Due to the latter, the final number
## of samples can be <<N which may influence the accuracy of the results
## The algorithm monitors the acceptance probability and increases the
## number of points samples  (up to N*Nmax) to ensure that N points are
## actually returned for both SE and Treament Effect

##   X:  exponentiated treatment effect
## L,U: lower and upper limits of the pci (usually 95%) CI
##   p: p-value
##   d: precision of X,L,U
##  dp: precision of p value
##   N: number of Monte Carlo samples

extractTSE<-function(X,L,U,p,d,dp,pci=0.95, N=1000000,Nmax=10)
{

## Compute bounds of belief functions
a<-5*(10^(-d-1));
ap<-5*(10^(-dp-1));
Xl<-X-5*(10^(-d-1));
Xu<-X+5*(10^(-d-1));
Ll<-L-5*(10^(-d-1)); ## cannot have negative ests for pos quant
Lu<-L+5*(10^(-d-1));
Ul<-U-5*(10^(-d-1));
Uu<-U+5*(10^(-d-1));
pn<-(p-5*(10^(-dp-1)))/2
pp<-(p+5*(10^(-dp-1)))/2
pn<-max(0,pn)
pp<-min(1,pp)
pl<-abs(qnorm(pn));
pu<-abs(qnorm(pp));
pci<-abs(qnorm((1-pci)/2));
ret<-list("TE"=rep(0,N),"se"=rep(0,N),"prej"=0,"Nact"=0,"N"=0)
Nest<-0
Nsample<-0 ## how many samples were they obtained?
Nget<-N
Nmax<-Nmax*N ## maximum number of estimates to obtain
while(Nest<N & Nsample <= Nmax & Nget>0) {

ret0<-simTSE(Nget,Xl,Xu,Ll,Ul,Lu,Uu,pl,pu,pci)
## take as much as needed to fill the N slots
ntake<-min((N-Nest),ret0$Nact) ret$TE[(Nest+1):(Nest+ntake)]<-ret0$TE[1:ntake] ret$se[(Nest+1):(Nest+ntake)]<-ret0$se[1:ntake] Nest<-Nest+ntake Nsample<-Nsample+Nget Nget<-min(N,floor((N-Nest)/ret0$prej)) ## only obtain as many as anticipated
}
ret$N<-Nsample;ret$Nact<-Nest;ret$prej<-ret$Nact/ret$N ## now filter out the slots of N that were never sampled ret$TE<-head(ret$TE,ret$Nact)
ret$se<-head(ret$se,ret$Nact) ret } simTSE<-function(N,Xl,Xu,Ll,Ul,Lu,Uu,pl,pu,pci) { l<-runif(N,Ll,Lu); ## belief about lower limit of CI u<-runif(N,Ul,Uu); ## belief about upper limit of CI x1<-sqrt(u*l); y1<-sqrt(u/l); x1.t<-log(x1) y1.t<-log(y1)/pci; z<-abs(x1.t/y1.t); ## select only points satisfying the point estimate constraint sel<-I(x1>=Xl)*I(x1<=Xu) if(!is.finite(pl) ) { zlsel<-rep(1,length(z)) } else { zlsel<-I(z<=pl)} if(!is.finite(pu) ) { zusel<-rep(1,length(z)) } else { zusel<-I(z>=pu)} psel<-(zusel*zlsel)==1 ## select points satisfying both criteria selAll<-(psel*sel)==1 Nact<-sum(selAll,na.rm=T) ## the reason for the next test is to remove impossible ## sampled values i.e. negative values for the sampled values ## from the joint distro of the CI or the RR itself selAll<-selAll & !is.na(selAll) return(list("TE"=x1.t[selAll],"se"=y1.t[selAll],"prej"=Nact/N,"Nact"=Nact,"N"=N)) } <pre> The implementation is based around two functions with the user only needing to call the first one. Note that for some combinations of input values, a large number of random numbers (up to Nmax*N) have to be generated in order for the Monte Carlo algorithm to retain a certain level of precision (controlled by the number of samples N). Let’s test these two approaches from within R now in a toy example:  > d<-2 ## precision TE, 95% CI (all exponentiated) to > dp<-3 ## precision of p-value > mn<-0.2209039 ## true treatment effect (log scale) > se<-0.2761299 ## true SE > > X<-round(exp(mn),d);L<-round(exp(mn-qnorm(0.975)*se),d);U<-round(exp(mn+qnorm(0.975)*se),d) > p<-round(2*(1-pnorm(abs(mn/se))),dp) > > X ## rounded/exponentiated treatment effect [1] 1.25 > L ## lower 95% CI (rounded/exponentiated) [1] 0.73 > U ## upper 95% CI (rounded/exponentiated) [1] 2.14 > p [1] 0.424 > > moko<-extractTSE(X,L,U,p,d,dp,N=10000) > mean(moko$TE) ## Bayesian TE
[1] 0.2204252
> mean(moko$se) ## Bayesian SE [1] 0.2757051 > > moko2<-algSTE(X,L,U,p) > moko2$TE ## algebraic TE (log scale)
[1] 0.2230955
> moko2\$se ## algebraic se
[1] 0.2767075

The performance of the two approaches appears similar in this particular case!

However as we show here there will be occasions which the simple(-minded) algebraic approach will fail and the (big) Bayesian guns will be required.  Posting the source code will ensure that the Bayesian procedure becomes available to everyone who needs to carry out a meta-analysis!!