I can’t believe I am writing a follow up post about the analysis of COVID19 seroprevalence data, but it seems we now have a cottage industry of poorly executed and **analyzed** studies that make wild claims about the prevalence of asymptomatic COVID19 infections. You can read about the Santa Clara study here, which was widely criticized not only methodologically, but also left many prominent commentators wondering about the actual motives of the authors.

The same group is now reporting high seroprevalence rates in Los Angeles, but with an interesting twist: the press release came before the preprint, the data and the methodology were not made available. Hence, I decided I will not bother with these claims of the more recent study, until everything has been disclosed.

I decided to revisit the Santa Clara paper, because of Twitter commentary showing that the authors made numerous mistakes in the calculation of the uncertainty of their estimates:

We could probably work out the issues with the delta method, but I have had way too many zoom sessions for the day to think clearly. In any case, you can go back in time and read about how to compute variances of ratios in this very blog.

However, it makes more sense to show how to compute the prevalence (both point estimate and its uncertainty) with graphical (causal) models. This way, you will never have to use approximations based on Taylor series, unless of course you are trying to keep your math skills sharp.

*I will discuss two different graphical models for these calculations*: both of them can be given a Bayesian interpretation, but the first model allows feedback of the serologic survey data to influence the estimates of the assay sensitivity and specificity, while the other does not. The second model, is in fact rather rudimentary to reinterpret as a non Bayesian likelihood model, so that even people who don’t dig Bayes (“what is wrong with you?”) may feel comfortable using. In case you have not heard the term “graphical models” before you should not despair: graphical models are extremely clever ways to organize and keep track of probabilistic calculations, while also serving as communication devices that show how the different quantities under consideration influence each other.

**The data**

Similar to our previous post we will assume that we have access to **three **different datasets, shown below.

Dataset | Data | Determinants of +ve result |

Gold Standard COVID19 +ve | TP, NP | Sensitivity |

Gold Standard COVID19 -ve | TN, NN | Specificity |

Serological Survey | P, N | Prevalence, Sensitivity, Specificity |

The *first *dataset is a gold standard dataset of people who are **known** **to have had the disease**. This dataset is used to determine sensitivity as a ratio of the True Positive assay results (TP) over all the disease +ve (NP) participants. The *second *dataset is one of people **who are known never to had had the disease**. This dataset is used to determine specificity as the ratio of the True Negative (TN) individuals over all individuals without the disease (NN). Such gold standard datasets could be provided by the manufacturer, or could even be determined locally. We assume that these data are available to us, and we will not engage in debate about the disease determinations were made in the gold standard data. In any case, these datasets provide information that is needed to estimate the prevalence in the *third *dataset, i.e. the serological survey itself. *The ratio of positive tests (P) in the serological survey over all participants in the survey (N) , i.e. the survey positivity rate (p) is a mixed bag of true positives and false positives*. This rate is related to prevalence () , sensitivity and specificity through a very simple formula:

Knowing the sensitivity, specificity and the positive rate in the survey would allow one to solve the equation above for and thus estimate the prevalence. Since the sensitivity and specificity are almost never known exactly, e.g. they are estimated from samples, one would like to take into account the uncertainty about these determinants of assay performance when calculating the prevalence. This is precisely why we need statistical analyses for!

**A full Bayesian solution**

The first probabilistic model (which was used in my previous post) relates the observables (second column of the previous table) to the parameters (unobserved quantities of interest) through the following Directed Acyclical Graphical model.

When executed as a Bayesian calculation, this model will take in the data in the table and produce estimates of the prevalence, sensitivity and specificity that are dependent on both the gold standard and survey data. It should be stressed that if this model is used, then the survey data are going to feedback into the estimates of sensitivity and specificity, which will no longer be determined solely by the gold standard data. STAN code to fit this DAG is shown below. Once the program executes, point and uncertainty estimates may directly obtained from Markov Chain Monte Carlo (MCMC) output e.g. by looking at the mean and the standard deviation.

model0.R<-' functions { } data { int casesSurvey; // Cases in survey int NSurvey; // ppl sampled in survey int TPGoldStandard; // true positive in Gold Standard Sample int NPGoldStandard; // Total number of positive cases in Gold Standard Sample int TNGoldStandard; // true negative in Gold Standard Sample int NNGoldStandard; // Total number of negative cases in Gold Standard Sample } transformed data { } parameters { real<lower=0.0,upper=1.0> Sens; real<lower=0.0,upper=1.0> Spec; real<lower=0.0,upper=1.0> Prev; } transformed parameters { } model { casesSurvey~binomial(NSurvey,Prev*Sens+(1-Prev)*(1-Spec)); TPGoldStandard~binomial(NPGoldStandard,Sens); TNGoldStandard~binomial(NNGoldStandard,Spec); Prev~uniform(0.0,1.0); Spec~uniform(0.0,1.0); Spec~uniform(0.0,1.0); } generated quantities { }'

**Compartmentalizing the calculations**

Some people may object to a particular feature of the previous model: the feedback of the serological survey data back to the estimates of the sensitivity and specificity. This is an unavoidable feature of the parameterization employed and the logical consistency of Bayesian calculations on graphs. However, there is a way to reorganize the calculation and avoid the feedback from taking place. The relevant model is shown below:

The three datasets are now analyzed separately, and each dataset only influences the binomial probability of the corresponding model: sensitivity, specificity and survey positivity rate. This analysis can be run either in a Bayesian mode, or in frequentist mode (e.g. by running three logistic regressions, one for each dataset). After the analyses have concluded we sample the distributions for the three rates, and compute the prevalence according to the equation:

This is a simple Monte Carlo calculation, which can be applied to either a Bayesian posterior probability or a sampling distribution (e.g. by sampling from the normal distribution with mean equal to the parameter estimate and standard deviation equal to the standard error of the model fit). Once we have simulated many values for , we discard those that are negative or greater than one, and form means and standard deviations to generate point summaries and uncertainty estimates. The following code, shows how this model can be implemented in STAN. The *generated quantities *section samples the values of , and the values that are not compatible with the range constraints are discarded in R.

## uses simulation model1.R<-' functions { } data { int casesSurvey; // Cases in survey int NSurvey; // ppl sampled in survey int TPGoldStandard; // true positive in Gold Standard Sample int NPGoldStandard; // Total number of positive cases in Gold Standard Sample int TNGoldStandard; // true negative in Gold Standard Sample int NNGoldStandard; // Total number of negative cases in Gold Standard Sample } transformed data { } parameters { real<lower=0.0,upper=1.0> Sens; real<lower=0.0,upper=1.0> Spec; real<lower=0.0,upper=1.0> totposrate; } transformed parameters { } model { casesSurvey~binomial(NSurvey,totposrate); TPGoldStandard~binomial(NPGoldStandard,Sens); TNGoldStandard~binomial(NNGoldStandard,Spec); totposrate~uniform(0.0,1.0); Spec~uniform(0.0,1.0); Spec~uniform(0.0,1.0); } generated quantities { real Prev; Prev = (totposrate+Spec-1)/(Sens+Spec-1); }'

**Comparison between the two approaches**

How different are these two approaches? I used R and STAN to rerun the analysis of the Santa Clara survey (for simplicity, the gold standard datasets from the Stanford and manufacturer were combined together). The code is available as a word document below.

Parameter | Full Bayesian Approach | Compartmentalized Approach |

Prevalence | 0.010 (0.005) | 0.011 (0.005) |

Sensitivity | 0.838 (0.033) | 0.838 (0.033) |

Specificity | 0.993 (0.003) | 0.993 (0.004) |

The point and uncertainty estimates of all quantities are virtually indistinguishable, showing that in this particular case, feedback from the survey data does not affect the estimates of sensitivity and specificity.

Take home point: leave the delta method behind, or at least use higher order asymptotics when computing variances of ratio variables. Or if you are lazy like me, just fire your Monte Carlo samplers.