Archive for the ‘Uncategorized’ Category

Sequential Fitting Strategies For Models of short RNA Sequencing Data

June 18, 2017

After a (really long!) hiatus I am reactivating my statistical blog. The first article  concerns the clarification of a point made in the manual of our recently published statistical model for short RNA sequencing data.
The background for this post, in case one wants to skip reading the manuscript (please do read it !), centers around the limitations of existing methods for the analysis of data for this very promising class of biomarkers. To overcome these limitations our group comprised from investigators from Division of Nephrology, University of New Mexico and the Galas Lab at Pacific Northwest Research Institute introduced a novel method for the analysis of short RNA sequencing (sRNAseq) data. This method (RNAseqGAMLSS), which was derived from first principles modeling of the short RNAseq process, was shown to have a number of desirable properties in an analysis of nearly 200 public and internal datasets:

  1. It can quantify the intrinsic, sequencing specific bias of sRNAseq from calibration, synthetic equimolar mixes of the target short RNAs (e.g. microRNAs)
  2.  It can use such estimates to correct for the bias present in experimental samples of different composition and input than the calibration runs. This in turns opens the way for the use of short RNAseq measurements in personalized medicine applications (as explained here)
  3. Adapted to the problem of differential expression analysis, our method exhibited  greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of short RNA-seq data.
  4. In contrast to these popular methods which force the expression profiles to have a certain form of symmetry (equal number and magnitude of over-expressed and under-expressed sequences), our method can be used to discover global, directional changes in expression profiles which are missed by the aforementioned methods. Accounting for such a possibility may be appropriate in certain instances, in which the disease process leads to loss or gain in the number of cells of origin of the affected organ/tissue.

The proposed methodology which is based on Generalized Additive Models for Location, Scale and Shape (GAMLSS) involves the fitting of simultaneous regression models for the location (mean) and the scale (dispersion) of sequence counts using either the Negative Binomial or a particular parameterization of the Normal distribution. However there is price to pay for the advantages of RNASeqGAMLSS over alternatives: this comes in the form of a small (but not infinitesimal) probability[1] that the fitting algorithm will execute successfully. In the manual of our method (Section 6.4) we explain that a numerically more stable way of fitting these complex models exists and should be adapted if one encounters numerical optimization errors with the default approach used in the Nucleic Acids Research (NAR) manuscript. The three steps of this sequential fitting strategy are as follows:

  1. One fits a Poisson count mixed model to the RNAseq data, to obtain estimates of the relative mean expression for the different short RNA species in the expression profile
  2. These estimates are used to fix the values of the mean parameter model of the RNASeqGAMLSS model while estimating the values of the dispersion parameters.
  3. Finally, one uses the values of the mean (Step 1) and dispersion (Step 2) parameters to fit the general RNASeqGAMLSS model

In essence one ignores the overdispersion (additional variability) of the short RNAseq data (Step 1) to guide the algorithm into estimates of the dispersion parameters (Step 2). Finally one uses the separate estimates of the mean (Step 1) and dispersion (Step 2) parameters as an initial point for the simultaneous estimation of both (Step 3). The reason that this approach works is because the dispersion parameters do not affect the mean parameters, so that the Poisson distribution of Step 1 has the same mean as the RNASeqGAMLSS model. Hence the estimates produced by this Step are identical (to the limit of numerical precision) to those that would have been produced by a successful execution of the RNASeqGAMLSS optimization algorithms. Fixing these values when fitting the RNASeqGAMLSS model in Step 2 facilitates estimation of the dispersion parameters. Having very good initial guesses for these parameters virtually guarantees convergence of the 3rd Step (which is the only step in the NAR paper).

A fully worked out example is shown below (note that the data used in the NAR paper, source code, manual that includes instructions to compile the source code of the RNASeqGAMLSS and Win64 DLL libraries are all available in the BitBucket repository for this project)

First, we load the data, the C++ libraries and extract the data to the two groups we would like to compare :

library(TMB) ## the TMB framework for optimizati
## load the DE C++ libraries
## Note about the form of data storage for use with this software
## the long format should be employed when storing microRNA data for
## GAMLSS type of analyses: in this format, the data are arranged in columns:
## - column miRNA : yields the name of the miRNA
## - column reads : reads of the miRNA from a *single sample*
## - column SampleID: the sample the reads were recorded at
## this is called long format, because data from experiments are stored in
## different rows (rather than appearing as consecutive columns)
## lads the data from the 286 series
## Obtain data for GAMLSS - we will compare the two ratiometric series
datRat<-subset(dat286.long,(Series=="RatioB" | Series =="RatioA") & Amount=="100 fmoles")

u_X<-as.numeric(factor(datRat$miRNA)) ## maps readings to the identity of the miRNA
u_G<-as.numeric(factor(datRat$Series)) ## maps counts to group
y=datRat$reads ## extract the actual counts
X<-model.matrix(~Series,data=datRat) ## design matrix (ANOVA) for group comparisons

Secondly, we fit the Poisson model (Step 1), using the facilities of the lme4 R package:

## fit the parameters for the mu submodel using the poisson GLM

Then we extract the values of these parameters and used them to fix the values of the mean submodel (Step 2):

## initializes standard deviation of RE for the mu submodel
sigsig=rep(1,max(u_G)) ## initializes standard deviation of RE for the phi submodel
b=fixef(gl) ## initial values for the overall group means (mean submodel)
## initial values for the variation of miRNAs around their group mean (mean submodel)
## Very rough initial values for the phi submodel parameters
s_b=rep(0,ncol(X)) ## initial values for the overall group means (phi submodel)
## initial values for the variation of miRNAs around their group mean (phi submodel)
u_s= matrix(0,max(u_X),max(u_G))
## MAP list that allow us to fix some parameters to their values
## construct the AD object - note that we fix the mu at their values while estimating the
## phi submodel
## parameter estimation - note errors may be generated during some iterations
## obtain the report on the parameters to extract the fitted values of the gamlss model
u_s = matrix(summary(rep,"random",p.value=FALSE)[,1],ncol=max(u_G))
dummy<-summary(rep,"fixed",p.value=FALSE)[,1] ## parameter estimates

Finally, we refit the model letting all parameters vary:

## scale objective by the magnitude of the deviance of the fitted Poisson model
## obtain the report on the parameters
## differential expression ratios, standard errors z and p-values
## rownames are the miRNAs; columns are estimates, standard error, z value and pvalue

## the final estimates with their standard errors

These steps (and the RNASeqGAMLSS code) is going to be incorporated in an upcoming Bioconductor package for the analysis of short RNA sequencing data by Dr Lorena Pantano PhD. Until this package becomes available, the aforementioned code snippets may adapted very easily to one’s application by suitable adaptations of the code (i.e. the names of the columns corresponding to the RNA species, sample identifiers and experimental groups).

1. This is particularly likely when the underlying software implementations are not compiled against the Intel®Math Kernel Libraries.

Why should I be Bayesian when my model is wrong?

May 9, 2017

Excellent discussion and some nice references about Bayesian modelling in the context of model misspecification 

My 2015 blogging report

December 29, 2015

The Weibull distribution is useful but its parameterization is confusing

June 5, 2015

The Weibull distribution is a very useful generalization of the exponential distribution that frequently appears in analysis of survival times and extreme events. Nevertheless it is a confusing distribution to use due to the different parameterizations that one finds in the literature

So be aware 🙂

Machine Learning Cheat Sheet

May 24, 2015

Simply excellent – includes a section of Bayesian vs Frequentist Analyses

Check out @freakonometrics’s Tweet:

Kicking ass with Bayesian Statistics in R

November 22, 2013

Some excellent R posts regarding Bayesian statistics:

1. How to program the Laplace approximation in R:

Though heavily dominated by Monte Carlo methods, Bayesian computation with the Laplace expansion is a nice tool to deploy in cases your MCMC fails to converge. Plus it makes one appreciate Laplace’s genius.

2. A bird’s eye view of R’s Bayesian analysis facilities:

Watch this blog for a series of posts about Bayesian survival analysis with R, BUGS and stan.

Failed Randomization In A Randomized Trial?

November 5, 2013

We will continue the saga of the three-arm clinical trial that is giving the editors of the prestigious journal The Spleen a run for their money. While the polls are gathering digital dust, let’s see if we can direct this discussion to a more quantitative path. To do so, we will ask (and answer) the question from a frequentist point; according to this approach we raise the red flag if the event under examination is rare assuming a hypothesis about the state of the world (null hypothesis H_0) is true.

In this case the null hypothesis is that the investigators at Grand Fenwick Memorial did run a randomized control trial under a simple randomization scheme, in which each patient had equal chance to be given one of the three interventions: GML, SL or MBL. To calculate the rarity of the observed pattern, we need to define an appropriate event and then figure out its rarity (“long-term frequency”) in many repetitions of the randomization allocation scheme used in the trial.

Considering the number of patients in the three arms of the trial, 105/70/65, v.s. the expectation of 80/80/80  it would appear that the most influential factor in determining the “rarity” of the observed pattern is the difference in size between the largest and the smallest arm in the trial.  On the other hand a difference of 5 between the second largest and the smallest arms would not appear to be worthy of consideration, at least as a first approximation. To determine the long term frequency of the event in a trial with 240 patients, we will use the R language to carry out a large number of these hypothetical allocations and figure out the number of those in which the difference in size between the largest and smallest arms exceeds 40:

 event<-c(105,70,65)  ## observed pattern
## computes the difference in size between arms
frequentist2<-function(x,l1=40) {
set.seed(4567) ## for reproducibility
## hypothetical trials
## flags the repetitions of the studies in which a rare
## event was observed and calculates the frequency (in %)

This number comes out to be 0.5%. In other words, 1 out of 200 randomized trials that assign patients with equal probability to three arms will generate an imbalance of this magnitude.
But is this the answer we are trying to obtain? In other words the situation that the editors of The Spleen face is to evaluate the likelihood that patients were not randomly assigned to the three interventions. This evaluation is only indirectly related to the rarity of observing a large size difference in the arms of a trial that did not cheat. By not considering directly the hypothesis of foul-play (unequal allocation probabilities in the three arms), both the investigators and their accusers will find themselves in endless quarrel about the interpretation of rarity as a chance finding v.s. an improbable one indicative of fraud.

The probability of stacked mass murders is not so small

November 4, 2013

Christian Robert estimates the probability of the observed pattern of 4 mass murders in 4 days and founds it to be 2.8% per year or 18% in any seven year period!

Diagrams for hierarchical models

November 2, 2013

Nice alternative to BUGS like diagrams.
Can communicate distributional assumptions, grouping of variables and seems a good alternative to the inclusion of code in long online supplements!!

R Booklets for biostats,bioinfo and time series analysis

October 28, 2013

Useful booklets by the Sanger Institute on:
Biomedical Statistics,
bioinformatics and
time series

These can get one started in R (the biomedical statistics one can be used for a medical student’s/resident/fellow research project).